Last modified: 2006/08/03
Program: PolyPhred All rights reserved. This software is part of a test version of the PolyPhred distribution package. It may not be redistributed, distributed in modified form, or used for any commercial purpose, including commercially funded sequencing, without written permission from the authors and the University of Washington. This software is provided "as is" and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In particular, this disclaimer applies to any diagnostic purpose. In no event shall the authors or the University of Washington be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage. |
This version contains a new method to detect and genotype diallelic indels (activated by -indel flag). The SNP detection method used by this version is the same as the one in PolyPhred version 5.04. The indel detection method focuses on accurately identifying the traces containing heterozygous indels. In short, it first identifies such trace and determines the length and the location of the indel in each of these traces; it combines the results across the traces that align at a locus to find indel sites in the sample population; and it then determines the genotypes of the individuals in the sample at each site. For every site, it computes and reports a score which reflects the strength of the evidence for an indel at that site. Similarly it reports a genotype score for every individual (or trace) at the site, which summarizes the confidence that should be placed in the genotype call. Based on our tests, the method will identify the indel lengths very accurately (~96% of the time). The determined indel location may however be a few basepairs upstream or downstream of the location a human expert might report. Therefore, a manual tagging system is provided for marking the correct locations of indels. The corrected locations and lengths will be reported in the PolyPhred output (see User-defined manual tags). PolyPhred will mark an indel site on the consensus sequence with an 'indelSite' tag and the traces containing heterozygous indels with a 'heterozygoteIndel' tag. Reamaining traces at the site will be marked with a 'homozygoteIndel' tag. The site score, genotype score, and the indel length are reported in the tags (visible in Consed). Versions of Consed prior to version 13.0 are not able to interpret the indel tags. To solve this problem, it is necessary to modify the .consedrc file. Add the following lines to the .consedrc file:
consed.customConsensusTag1: indelSite consed.tagColorCustomConsensusTag1: DarkCyan consed.customTag1: indel consed.tagColorCustomTag1: DarkOrange
If the 'customConsensusTag1' and 'customTag1' tags are already used, change the final number 1 in the tag names to the next available number.
Identification of potential heterozygous sites is based on 1) the presence of two significant overlapping fluorescence peaks at such sites in the sequence trace, and 2) detecting a decrease of about 50% in the peak heights when the sequence trace is compared with that obtained from homozygous individuals (references 1 and 2). PolyPhred scans for these two features when sequence traces are being compared to detect heterozygotes among homozygotes (reference 2). In addition, if double-stranded coverage of the reads is provided, the accuracy of the results is significantly increased.
PolyPhred is not a stand alone program. It is designed as a member of an integrated suite of sequence analysis applications that includes the programs Phred (references 3,4), Phrap (reference 5), and Consed (reference 6).
During the SNP search phase, PolyPhred combines information from all of the sequence traces to derive a genotype and a score for each sequence (see How PolyPhred scores SNP sites). It also uses a standard sequence for comparison to identify sites that are homozygous for a minor or alternative allele. The score indicates how well the trace at the site matches the expected pattern for a SNP. After PolyPhred identifies the putative polymorphic sites, it updates the ACE and PHD files by adding tags that mark the positions of the sites. The tagged sites can then be examined using the program Consed. PolyPhred also generates a detailed output that lists the positions, genotypes and scores of the polymorphic sites in a format that can be easily parsed into a database program.
Part of the process of computing score involves averaging certain values across the reads in the assembly. For small assemblies, the accuracy of these averages increases with the number of overlapping sequences. This in turn increases the reliability of the results. We recommend that the region of interest should be covered by at least eight independent sequences, if possible.
A significant increase in the rate of true-positive SNPs can achieved by sequencing each sample in both directions. PolyPhred combines double-stranded information to enhance the accuracy of its genotype calls. To take advantage of this feature, it is necessary to use a sensible naming convention when naming the sequence data files. The sequence file names should contain a contiguous set of characters that identify the individual source. Using the -source flag (see below), PolyPhred can then match sequences that are from the same source (see also Reducing the false-positive rate).
-block -SAMPLE -COVERAGE- If omitted: all blocks except MICROSATELLITE are included in the output.
-dir (-d) [work directory]
Use this flag to specify the location of the data. The flag allows PolyPhred to be run
from a directory other than the one containing the data to be analyzed (see
Running PolyPhred).
- If omitted: PolyPhred must be run either from the edit_dir directory or from the data
directory of the data to be analyzed.
-flanking (-f) [number]
Use this flag to specify the number of bases flanking the polymorphic sites reported in the POLY
and POLYINDEL blocks of the PolyPhred output.
- Accepted numbers: 0 - 50
- If omitted: 10
-group (-g) [regular expression]
This flag specifies a subset of the files to be used in the analysis. PolyPhred analyzes only
those sequences with a name that matches the regular expression.
- If omitted: .+
-help (-h)
Use this flag to see information on how to use PolyPhred. The flags are listed along with their
allowed and default values.
- If omitted: normal operation
-idat
This is an optional flag. If included, the method writes the results of trace-by-trace analysis by the indel algorithm (see -indel flag, below) to the standard output for debugging and
computing the detection accuracy. To avoid mixing this output with PolyPhred's output report, use the -o flag to specify filename for
PolyPhred's output report.
-indel (-i) [number]
This flag instructs PolyPhred to run the indel detection algorithm. The
flag can be optionally followed by an integer from 1 to 30, which
specifies the highest value of the length of indel the method searches
for. For example, to search for indels of length up to 15, use: -i 15.
The computational time is proportional to the value of this integer and
the method requires approximately 4 hours to analyze a 30Kb gene
sequenced across 47 individuals when searching for indels of length up
to 30. In our datasets ~85% indels are <5bp, ~95% indels are
<13bp and ~99% indels are <31bp in lengths. The computational
time can be considerably shortened by using a smaller value of the
indel length supplied to this flag. The method uses the basecalls of
the reads containing heterozygous indels in the data. Therefore, if the
basecalls have been removed manually or otherwise (e.g. converted to
'N') in order to prevent PolyPhred from reporting sites in these reads
as SNPs, the indel detection method will not function properly. The -s
flag to activate the source genotype
resolution function is also applicable and recommended for this
algorithm. The method, by default, does not report indels that follow
poly tracks with 8 or more repeats (as these can be indel errors during
PCR amplification). This default can however be changed using the -md
flag (see below).
The following flags are functional only if -i flag is used: -inav, -iscore, -md,
-idat.
-inav [on / off / filename]
This
is an optional flag that can be used to create a "navigation" file.
Using a navigation file is a convenient and quicker way to confirm the
indels identified by the method when using Consed. We highly recommend
using the navigation file to browse and confirm the indels found by the
method. The flag can be followed by on /off /filename. If no filename
is specified (e.g., -inav on), it creates a file called as indel.nav in
the edit_dir of the gene. To use this file, from the main window of
Consed select Navigate -> Custom navigate -> filename. This
opens a popup window. Each entry in the window corresponds to an indel
site and displays the
Contig Name, name of the highest scoring heterozygous read at the site,
the consensus position of the indel, length of the indel, score
assigned to the heterozygous genotype of that read, and score assigned
to the indel site. Double-clicking on an entry in this list will focus
the cursor on the "best" heterozygous indel read at that site in the
Aligned Reads Window of Consed. The corresponding trace can then be
visualized by middle-clicking on the read.
-iscore [number]
This
is an optional flag that has to be followed by an integer value (from 0
to 99). This value specifies the score cutoff for reporting indel
sites. For example, -iscore 80 will only report sites that have a score
of at least 80. If the flag is not specified the default value for
cutoff used is 80.
EXAMPLE:
polyphred -d <the_full_pathname_of_gene> -i 20 -inav on -iscore 85 -s 10 13 -o output_file
This command will search for indels of length up to 20, resolve genotypes across the sequences using the characters: 10 through 13 in a trace name to identify the name of the individual, report those sites that score at least 85, and write
indel.nav file in the directory <the_full_pathname_of_gene> /edit_dir
-md [a sequence of space-separated numbers]
This
optional flag can be used to specify a definition of a microsatellite.
Indels that occur downstream of these will not be reported. To define a
microsatellite using this flag, specify a sequence (of length up to 8)
of integers. Each of the values corresponds to the minimum number of
repeats of the unit, where the length of the unit equals to the index
of the integer in the sequence. For example, -md 8 5 4 4 4 4 defines a
microsatellite as: a mononucleotide repeats with at least 8 repeats,
dinucleotides with at least 5 repeats, trinucleotides with at least 3
repeats, and so on.. . If this flag is not specified, the default used
is: -md 8 8 8 8 8 8 8 8. The operation of this flag is independent of other microsatellite related flags (such as -ms) used in SNP discovery and genotyping.
-ms [x / on / off]
Use this flag to switch on or off the marking of simple microsatellite repeats. If the argument 'x'
is passed, putative SNP sites that are found within microsatellites are given a score equal to the
score limit (see -score).
- Default argument: on
- If omitted: off
-nav (-n) [file name / on / off]
Use this flag to generate a navigation file listing the polymorphic sites. If
the file name is given but does not have a final ".nav" extension, PolyPhred adds
one. The file is written to the edit_dir directory of the working directory.
- Default argument: on, using the file name "polyphred.nav"
- If omitted: off
To use the navigation file in Consed, click on 'Navigate', located at the top of the 'Consed Main Window'. Then click on 'Custom Navigation'. The window that appears should contain the name of the navigation file. Click on the file name to bring up the navigation window.
-output (-o) [file name / on / off]
Use this flag to send the PolyPhred output either to a file or to the standard output (the screen).
If the argument is "off", the output is written to the screen. In this case, the output can be
redirected to a file using '>'.
- Default argument: on, using the file name "polyphred.out"
- If omitted: off
-quality (-q) [value]
Use this flag to set the quality threshold. PolyPhred uses the quality threshold to determine the extent of the
excluded, or trimmed, regions at the ends of the sample sequences (the regions shaded in yellow when the
assembly is viewed in Consed). Reducing this value results in less trimming of the ends. See
Reducing the false-positive rate.
- Accepted value: 0 - 50
- If omitted: 25
-rank (-r) [value / on / off]
Use this flag to direct PolyPhred to score sites with the six-point ranking system. To set the rank
threshold, follow the flag with a number from 1 to 6. PolyPhred marks and reports only sites that are
assigned a rank between 1 and the rank threshold, inclusive. See
Reducing the false-positive rate.
- Accepted value: 1 - 6
- Default argument: on, using the value 3
- If omitted: the 100-point scoring system is used.
-ref [reference sequence identifier / on / off]
Use this flag to specify a reference sequence for reporting of polymorphic site positions. In this case,
PolyPhred uses the consensus sequence as the standard, rather than the reference sequence (see -refcomp below).
See Using a reference sequence.
- Default argument: on, using the identifier ".REF"
- If omitted: off
-refcomp [reference sequence identifier / on / off]
Use this flag to direct PolyPhred to use a reference sequence as the standard rather than the consensus
sequence. See Using a reference sequence.
- Default argument: on, using the identifier ".REF"
- If omitted: off
-source (-s) [/delimiter / posn1 posn2 / off]
Use this flag to activate the source genotype resolution function and set the location in the chromat file
names of the source identifier, or turn the function off. The source identifier is a contiguous set of characters
that uniquely identifies the source of the DNA sample. PolyPhred uses the source identifier to match
sequences from the same DNA sample. See Reducing the false-positive rate.
The source identifier can be placed in the chromat file names in either of two methods. One method is to flank the identifier characters with a delimiter. Any valid file name character can serve as the delimiter. When running PolyPhred, indicate the delimiter as follows ('c' is the delimiter character):
polyphred -s /cFor example, if the chromat file names are of the form: abc-source-xyz.scf
polyphred -s /-
The second method for locating the source identifier is to place the identifier characters in a constant location in all chromat file names. Indicate the location of the identifier characters as follows:
polyphred -s posn1 posn2For example, if all chromat file names are of the form: abcSOURCExyz.scf
polyphred -s 4 9
If the function has been activated in the .polyphredrc file, it can turned off with the 'off' argument.
- If omitted: off
-score [number]
Use this flag to select the 100-point scoring system and set the score threshold. PolyPhred marks and reports
only sites that are assigned a score between 99 and the score threshold, inclusive.
See Reducing the false-positive rate.
- Accepted numbers: 0 - 99
- If omitted or argument omitted: the 100-point scoring system is used with a score threshold of 70
-snp [het / hom / on / off]
Use this flag to switch on or off SNP detection, or to select either the marking of heterozygous (het)
or homozgous (hom) polymorphisms only.
- Default argument: on, marking both heterozygous and homozygous polymorphisms
- If omitted: on
-tag (-t) [tag type]
Use this flag to specify the tag type with which SNP sites viewing in Consed. The three tag types
are "genotype", "polymorphism", and "rank". The tag types can be abbreviated as g, p and r, respectively.
Using the genotype tag results in putative polymorphic sites marked on the consensus sequence with
color-coded tags indicating rank, and putative SNPs marked with pink tags on the sample sequences.
Using the rank tag results in color-coded tags indicating rank placed on both the consensus sequence
and the sample sequences (see How PolyPhred scores SNP sites for the color codes.)
Using the polymorphism tag results in a blue tag placed on all putative polymorphic sites on the
consensus sequence and pink tags indicating putative SNPs on the sample sequences.
- If omitted: genotype
-update [on / off]
Use this flag to control updating of the ACE and PHD files. If updating is switched off, the ACE
and PHD files are not updated, and the PolyPhred results can not be viewed in Consed.
- Default argument: on
- If omitted: on
-verbosity (-v) [0 / 1 / 2]
Use this flag to set the level of status reporting that will written to the screen
as PolyPhred is running. The allowed arguments range from 0 (least reporting) to 2 (most
reporting).
- If omitted: 0
-version
Use this flag to see the PolyPhred version and build number.
- If omitted: normal operation
-window (-w) [number]
Use this flag to set the window width. PolyPhred uses the window width, together with the quality threshold,
to determine the extent of the excluded, or trimmed, regions at the ends of the sample sequences (the
regions shaded in yellow when the assembly is viewed in Consed).
- Accepted numbers: 5 - 50
- If omitted: 20
-xml [on / off]
Use this flag to specify the format of the PolyPhred output.
- Default argument: on
- If omitted: off
If the -source flag is used, PolyPhred combines the information in matched reads to increase the accuracy of its genotype calls. Scores for genotypes that are in agreement are increased (see Reducing the false-positive rate).
When all sites at a position (i.e., a column as viewed in Consed) have been assigned a score, PolyPhred calculates an overall score and genotype for the position. This score depends on the highest-scoring site in the sample sequences. If the overall score is greater than or equal to the score threshold (see the -score flag), then PolyPhred marks the position as polymorphic. The number of sites that PolyPhred marks can be controlled by adjusting the score threshold (see Reducing the false-positive rate).
If the six-point ranking system is selected, PolyPhred converts the score to a rank according to the table below. Along with each rank is the color of the tags as displayed in Consed.
The 'True Positive Rate' column shows the percentage of true positive SNPs marked within each rank, as found in our own analysis, using the default -score and -quality settings. These results may very depending on changes in these settings, as well as the quality of the data and number of samples analyzed.
Score | Rank | Tag Color | True Positive Rate |
99 | 1 | red | 97% |
95-98 | 2 | orange | 75% |
90-94 | 3 | green | 62% |
70-89 | 4 | dark blue | 35% |
50-69 | 5 | magenta | 11% |
0-49 | 6 | purple | 1% |
The POLY block
In this block, the putative SNP sites identified by PolyPhred are listed, as well as sites
marked by columntag type tags (see User-defined manual tags). Each line
reports the consensus sequence position, the 5' sequence flanking the polymorphic site, the
two most common alleles at the site, the 3' sequence flanking the site, and the over-all score
assigned to the site.
- XML tag: block-snp_site subtag: snp_site
The GENOTYPE block
In this block, the genotypes of the individual sample sequences are listed for each putative
SNP site the POLY block. Each line reports the consensus sequence position, the position relative
to the sample sequence, the name of the sample sequence, the two alleles at the position, and the
score.
- XML tag: block-snp_genotype subtag: snp_genotype
The COLUMNGENOTYPE block
In this block, the genotypes of the individual sample sequences are listed for each manual-SNP tag
applied to the consensus sequence. Each line reports the consensus sequence position, the position
relative to the sample sequence, the name of the sample sequence, the two alleles at the position,
and the score. PolyPhred obtains the user-defined tags from the .polyphredrc file
(see User-defined manual tags).
- XML tag: block-manual_snp subtag: snp_genotype
The COLUMNINDEL block
In this block, the genotypes of the individual sample sequences are listed for each manual-indel tag. Each line reports the consensus sequence position, the position
relative to the sample sequence, the name of the sample sequence, and the genotype. The tag
used to specify the genotype can be user-defined in the .polyphredrc file
(see User-defined manual tags).
- XML tag: block-manual_indel subtag: manual_indel
- This is a new block.
The MANUALGENOTYPE block
In this block, Sample sequence sites that have been tagged manually are listed. Each line reports
the consensus sequence position of a tagged site, the position relative to the sample sequence
that was tagged, the identity of the tag, and the comment if one is present.
PolyPhred obtains the user-defined tags from the .polyphredrc file (see
User-defined manual tags).
- XML tag: block-manual_genotype subtag: manual_genotype
The VERIFIED block
In this block, sites manually tagged as verified are listed. Each line reports the consensus
sequence position and the tag identity. PolyPhred obtains the user-defined tags from the
.polyphredrc file (see User-defined manual tags).
- XML tag: block-verified_site subtag: verified_site
The MICROSATELLITE block
If the -ms flag is set to 'on', this block lists that microsatellite sequences that were found.
Each line reports the consensus sequence position of the 5' end of the microsatellite and the
repeat pattern.
- XML tag: block-microsatellite subtag: microsatellite
- This is a new block.
The SAMPLE block
The names of the sample sequences that were analyzed and their sequence qualities are
listed in this block. Each line reports the name of a sequence, the positions of the left
and right boundaries of the search region (between the trimmed ends), and the average site
quality, as determined by Phred, within the search region.
- XML tag: block-sample_quality subtag: sample_quality
The COVERAGE block
This block provides a tally of the number of sample sequences that PolyPhred examined
at each position. Each line reports the begin and end positions of a range relative
to the consensus sequence, followed by the number of sample sequences that were
analyzed in the range.
- XML tag: block-coverage subtag:
coverage
Running PolyPhred with -i flag adds two blocks to the output report of
PolyPhred: INDELPOLY block which contains the information about indel
sites, and INDELGENOTYPE block which contains information about the
genotypes at these sites.
INDEL blocks new to version 6.0
Running PolyPhred with -i flag adds two blocks to the output report of PolyPhred: INDELPOLY block which contains the information about indel sites, and INDELGENOTYPE block which contains information about the genotypes at these sites.
The INDELPOLY block
This block
reports: 1. the consensus position of the indel site; 2. the smallest
value among the consensus positions of indels found in the
heterozygotes at the site (as the indel positions determined by the
method may be different for different heterozygous read at a given
indel site); 3. the largest value among the consensus positions of
indels found in the heterozygotes at the site; 4. length of the indel;
and 5. the score assigned to the site.
The INDELGENOTYPE block
This block reports: 1. the consensus position of the indel site; 2.
consensus position of the indel found in the read (if the genotype is
not heterozygous, this value is the same as 1); 3. length of the indel
found in the read (for homozygotes, this value is 0); 4. name of the
read, 5. genotype score of the read, and 6. the genotype.
If the -idat flag is used, the above two blocks report additional information:
The INDELPOLY block
Two more columns are added to the original 5 columns. The 6th
column reports the log-likelihood-ratio score for the site and the 7th
column reports the location of a microsatellite found upstream of the
site (-1 if no microsatellite found).
The INDELGENOTYPE block
Two columns are inserted between the 5th and the 6th columns:
the first column reports log-likelihood ratio score for the genotype
and the second column contains the location of a microsatellite found
upstream of the site (-1 if no microsatellite found).
For PolyPhred to recognize the tags, they must be listed in the .polyphredrc file (see Customizing PolyPhred). Once the .polyphredrc file has been set up, the typical procedure is to 1) assemble the data, 2) run PolyPhred, 3) use Consed to analyze the results, mark sites and make changes, and 4) run PolyPhred again to obtain both the PolyPhred- and user-generated information in the output file.
The four tag types are:
The manualtag type
Tags of this type is used to mark or edit a site in a sample sequence. Typically these tags are
used to change the genotype call made by Phred or PolyPhred. Sites marked with these tags are
listed in the MANUALGENOTYPE block.
The verifiedtag type
This tag type is applied to the consensus sequence to indicate that a polymorphic site is
verified. Sites marked with these tags are listed in the VERIFIED block.
The columntag type
Tags of this type are applied to the consensus sequence and are used to force PolyPhred to provide
SNP genotypes for all of the sample sequences at the tagged sites. Sites marked by these tags are
listed in the POLY block, The genotypes in the sample sequence are listed in the COLUMNGENOTYPE block.
The columnindeltag type
Tags of this type are applied to the consensus sequence and are used to force PolyPhred to provide
indel genotypes for all of the sample sequences at the tagged sites. The tags can be used to
mark the positions and define the length of indel sites. The tag should "cover" the segment
involved in the indel so that PolyPhred can report the indel segment in the output. Sites marked
by these tags are listed in the POLYINDEL block, and the genotypes in the sample sequences
are listed in the COLUMNINDEL block. The name of the tag that marks the site will be used to indicate
the homozygous genotype. The heterozygous genotype can be set in the .polyphredrc file with the
'indelhettag' key-word. If this is not set, PolyPhred will indicate heterozygotes with the label
'heterozygoteIndel'.
indelSite
This tag is added to
the consensus sequence. Additional information included in the tag is:
The consensus location of the indel, score of the site and the length
of the indel.
heterozygoteIndel
This tag is
added to the heterozygous genotypes at the site. Additional information
included with this tag is: the consensus position, genotype, genotype
score and the length of the indel in the read.
homozygoteIndel
This tag is added to the homozygous genotypes at the site.
phred version 0.961028 or later phrap version 0.960731 or later phd2fasta version 0.971024 or later consed version 13.0 or later
polyphred the PolyPhred program polygen tool for making PHD and POLY files from ABI chromat files. sudophred tool for making chromat, PHD and POLY files from FASTA files polyphred.html this document phredPhrap perl script for running phred and phrap together in the correct order.
/usr/local/genome/bin/
# $polyPhredExe = "/usr/local/genome/bin/polyphred";
$bUsingPolyPhred = 0;
Read the section Customizing PolyPhred, as well as the section Detection of insertion/deletion polymorphisms for instructions on customizing Consed.
mkdir mydataWithin this directory, create the four subdirectories as follows:
cd mydata mkdir chromat_dir edit_dir phd_dir poly_dir
If a reference sequence is to be included in the assembly, use the sudophred tool to generate fake chromat, PHD and POLY files.
To assemble the data, cd to the edit_dir directory and run "phredPhrap mydata". The program phredPhrap automatically runs the programs Phred and Phrap consecutively. When the process is complete, there should be several files in the edit_dir, including one with the extension .ace.1 (the ACE file), and several files in the phd_dir and poly_dir directories.
View the assembled sequences in Consed. Further assembly of the data might be required. For information on this process, check the Consed documentation.
Run "polyphred". Include any desired flags on the command line.
Use Consed to view the polymorphic sites. with PolyPhred (see Customizing PolyPhred).
polygen
polygen -d ~/my_home_dir/gene_data
To force polygen to overwrite any existing PHD and POLY files, use the -overwrite (-o) flag.
Run "polygen -h" or "polygen -help" to show a list of the options.
Run "polygen -v" or "polygen -version" to show the version.
The sudophred program is a tool that can be used to generate fake chromat, PHD and POLY files from DNA sequences in FASTA format. Fake chromat and PHD files are needed if one wishes to include a reference sequence in the assembly of the data set (see Using a reference sequence). Also, if one wants to compare data from sequence trace (chromat) files with text sequences, the text sequences need to be converted into all three file types.
The sudophred program takes one text file as input. The text file can contain one or more sequences in FASTA format. If one is generating fake data files for a reference sequence, sudophred writes data files for the first sequence only. Otherwise, sudophred will generate data files for each of the sequences in the text file. In either case, the names of the data files are taken from the string that follows the '>' at the beginning of each sequence.
One way to run sudophred is to put the FASTA file in an edit_dir directory. Sudophred will write each file that it generates into the appropriate directory. That is, sudophred writes the chromat file in the chromat_dir directory, the phd file in phd_dir, and the poly file in poly_dir. One can also put the FASTA file in an arbitrary directory and run sudophred from there. In this case, sudophred will write all of the files into that same directory. The files must then be moved to the appropriate data subdirectories. In either case, it is easiest to generate the fake data files before running the phredPhrap program that assembles the data into contigs.
By default, sudophred writes all three files. The chromat files are written in SCF format. In the phd files, the quality values are all 59.
To run sudophred, enter:
sudophred [filename]
where filename is the name of the text file containing the sequences. The file name must always be the first argument.
To use sudophred to generate files a reference sequence, use the -r flag. This flag can be followed by a string that PolyPhred will use to identify the reference sequence. For example:
sudophred [filename] -r .XYZIf no string is supplied, sudophred will use the default string .REF
To change the quality threshold, use the -q flag followed by the value (an integer from 0 to 59). For example:
sudophred [filename] -q 20
To write the chromat files in ABI format, use the -abi flag.
sudophred [filename] -abi
Run "sudophred -h" or "sudophred -help" to show a list of the options.
Run "sudophred -v" or "sudophred -version" to show the version.
To include a reference sequence in the assembly, one should first create the necessary data files from the reference sequence. These files can be generated with the sudophred program supplied with PolyPhred (see Using the sudophred tool).
Use sudophred with the -r flag to generate the reference sequence data files. For example,
sudophred [filename] -rwhere filename is the name of the text file containing the reference sequence in FASTA format. The data files will be given names that begin with the string that follows the '>' at the beginning of the sequence, followed by the default reference identifier ".REF". In this case, one would run PolyPhred with the reference options as follows;
polyphred -ref
To specify a different reference identifier, follow the -r flag with the identifier string. For example, to set the reference identifier as "xYZ", run:
sudophred [filename] -r xYZIn this case, the data files will contain the string "xYZ" in the file names, rather than ".REF", and it will be necessary to select the reference option as follows:
polyphred -ref xYZ
setenv POLYPHRED_PATH [path]where [path] is the directory containing the .polyphredrc file.
flag -score 80 flag -q 30
flag -output out.txtchanges two defaults; it will set the name of the output file to 'out.txt' and cause PolyPhred to write the output in a file with that name rather than to the screen. To change the default file name but keep output to the screen as the default activity, use the 'outputfile' key-word, as:
outputfile out.txtThen, to use the new default output file name, run "polyphred -o on".
flag -ref [identifier] flag -refcomp [identifier] refID [identifier]
ranks
The 'ranks' key-word allows the user to change the values used to convert scores to ranks
(see How PolyPhred scores SNP sites). For example, the following line:
ranks 90 80 60 40 20will result in these conversions:
Probability | Rank |
99-90 | 1 |
89-80 | 2 |
79-60 | 3 |
59-40 | 4 |
39-20 | 5 |
19-0 | 6 |
acedir, phddir, polydir
The 'acedir', 'phddir' and 'polydir' key-words allow the user to set the locations for the data files to
directories other than the ones that required by Phred, Phrap and Consed. The 'acedir' sets the location of
the ace file (which is normally in the edit_dir directory). The 'phddir' and 'polydir' key-words specify
the locations of the phd and poly files, respectively. A directory are considered to be within the
work directory, unless an absolute path is given (starts with a '/'). Use a '.' to indicate that a directory
is the same as the work directory.
date
The 'date' key-word allows the user to set the format of the date that appears at the top of the output
file. The key-word must be followed by one of six format codes:
2-digit year | 4-digit year | format |
DMY | DMYY | day/month/year |
MDY | MDYY | month/day/year |
YMD | YYMD | year/month/day |
verifiedtag, columntag, indelhettag, manualtag
Four of the key-words set tag names for the four tag types (see
User-defined manual tags). Each tag type can have more than one name (see
the example .polyphredrc file below). In addition, the indelhettag key-word allows the user
to specify the tag that will be used to indicate heterozygous indels.
Here is an example of a .polyphredrc file:
date YYMD flag -q 25 flag -f 16 outputfile report.txt refID .refSeq # Manual Tags verifiedtag polymorphism columntag manualGenotype columnindeltag indel:++ columnindeltag indel:-- indelhettag indel:+- manualtag heterozygote manualtag homozygote manualtag indel
As with PolyPhred, the flag key-word can be used to set on of polygen's flags.
The 'chromatdir', 'phddir' and 'polydir' key-words allow the user to set the locations for the data files to directories other than the ones that required by Phred, Phrap and Consed. The 'chromatdir' sets the location of the chromat files. The 'phddir' and 'polydir' key-words specify the locations of the phd and poly files, respectively. A directory are considered to be within the work directory, unless an absolute path is given (starts with a '/'). Use a '.' to indicate that a directory is the same as the work directory.
If you have questions or problems with Phred, Phrap or Consed, or you need to obtain
these programs, please see the web site at:
http://www.phrap.org
If you have questions or problems with PolyPhred, please
Follow the "PolyPhred" link for the email address of the person to contact. Please do not email questions to the web master.
1. Kwok, P.Y., Carlson, C., Yager, T.D., Ankenar, W., and Nickerson, D.A., 1994 "Comparative analysis of human DNA variations by fluorescence-based sequencing of PCR products", Genomics 25, 615-622. 2. Nickerson, D.A., Tobe, V.O., and Taylor, S.L, 1997, "Polyphred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing", Nucleic Acids Research, 25: 2745-2751. 3. Ewing, B., Hillier, L., Wendl, M., and Green, P., 1998, "Basecalling of automated sequencer traces using phred. I. Accuracy assesment", Genome Research 8: 175-185. 4. Ewing, B. and Green, P., 1998, "Basecalling of automated sequencer traces using phred. II. Error probabilities", Genome Research 8: 186-194. 5. Green, P., 1994, Phrap, unpublished. http://www.phrap.org 6. Gordon, D., Abajian, C., and Green, P., 1998, "Consed: A grapical tool for sequence finishing", Genome Research 8:195-202. 7. Stephens M, Sloan JS, Robertson PD, Scheet P, Nickerson DA., 2006, "Automating sequence-based detection and genotyping of SNPs from diploid samples," Nat Genet. 2006 Mar;38(3):375-81. Epub 2006 Feb 19. 8. Bhangale T., Stephens M., Nickerson DA., 2006, "Automating resequencing-based detection of insertion-deletion polymorphisms" (submitted).