Last modified: May 5, 2008
Program: PolyPhred
Version: 6.15 Beta
Copyright © 2005-2008
by Deborah A. Nickerson, Scott Taylor, Natali Kolker, Jim
Sloan, Tushar Bhangale, Matthew Stephens, and Ian Robertson
University of Washington
All rights reserved.
This software is part of a test version of the PolyPhred distribution package. It may not be redistributed, distributed in modified form, or used for any commercial purpose, including commercially funded sequencing, without written permission from the authors and the University of Washington.
This software is provided "as is" and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In particular, this disclaimer applies to any diagnostic purpose. In no event shall the authors or the University of Washington be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.
This version contains a new method to detect and genotype diallelic indels (activated by -indel flag). The SNP detection method used by this version is the same as the one in PolyPhred version 5.04. The indel detection method focuses on accurately identifying the traces containing heterozygous indels. In short, it first identifies such trace and determines the length and the location of the indel in each of these traces; it combines the results across the traces that align at a locus to find indel sites in the sample population; and it then determines the genotypes of the individuals in the sample at each site. For every site, it computes and reports a score which reflects the strength of the evidence for an indel at that site. Similarly it reports a genotype score for every individual (or trace) at the site, which summarizes the confidence that should be placed in the genotype call. Based on our tests, the method will identify the indel lengths very accurately (~96% of the time). The determined indel location may however be a few basepairs upstream or downstream of the location a human expert might report. Therefore, a manual tagging system is provided for marking the correct locations of indels. The corrected locations and lengths will be reported in the PolyPhred output (see User-defined manual tags). PolyPhred will mark an indel site on the consensus sequence with an 'indelSite' tag and the traces containing heterozygous indels with a 'heterozygoteIndel' tag. Reamaining traces at the site will be marked with a 'homozygoteIndel' tag. The site score, genotype score, and the indel length are reported in the tags (visible in Consed). Versions of Consed prior to version 13.0 are not able to interpret the indel tags. To solve this problem, it is necessary to modify the .consedrc file. Add the following lines to the .consedrc file:
consed.customConsensusTag1: indelSite
consed.tagColorCustomConsensusTag1: DarkCyan
consed.customTag1: indel
consed.tagColorCustomTag1: DarkOrange
If the 'customConsensusTag1' and 'customTag1' tags are already used, change the final number 1 in the tag names to the next available number.
Single nucleotide polymorphisms (SNPs) are the most frequent form of DNA sequence variation in the human genome. The identification and typing of these variations plays a central role in analyzing the relationships between genome structure and function, and in understanding the allelic variation within and among populations.
Many techniques are used to identify sequence variants among different individuals using DNA amplified by the polymerase chain reaction (PCR). These include denaturing gel electrophoresis, chemical or enzymatic cleavage, heteroduplex analysis, the analysis of single-stranded DNA conformations, variant detector arrays, and direct sequencing of a PCR product. PolyPhred is a program that helps to accurately identify heterozygous sites in sequences produced by sequencing PCR products with fluorescence-based chemistries such as dye labeled terminators or dye-labeled primers. The program compares sequence traces and searches for homozygotes and heterozygotes.
Identification of potential heterozygous sites is based on 1) the presence of two significant overlapping fluorescence peaks at such sites in the sequence trace, and 2) detecting a decrease of about 50% in the peak heights when the sequence trace is compared with that obtained from homozygous individuals (references 1 and 2). PolyPhred scans for these two features when sequence traces are being compared to detect heterozygotes among homozygotes (reference 2). In addition, if double-stranded coverage of the reads is provided, the accuracy of the results is significantly increased.
PolyPhred is not a stand alone program. It is designed as a member of an integrated suite of sequence analysis applications that includes the programs Phred (references 3,4), Phrap (reference 5), and Consed (reference 6).
PolyPhred identifies potential heterozygous sites by comparing traces in a sequence assembly. Phred provides the base-calls, base quality information and the peak size information, which is stored in two types of files called PHD and POLY files. Phrap is used to assemble the input sequences into one or more contigs, and to derive a consensus sequence for each contig. The assembly information is stored in a file called the ACE file. PolyPhred uses all three file types to analyze the sequence traces. It first reads the ACE file to obtain the consensus sequence and the names of the trace (chromat) files used in the assembly. It then reads the PHD and POLY files associated with each trace.
During the SNP search phase, PolyPhred combines information from all of the sequence traces to derive a genotype and a score for each sequence (see How PolyPhred scores SNP sites). It also uses a standard sequence for comparison to identify sites that are homozygous for a minor or alternative allele. The score indicates how well the trace at the site matches the expected pattern for a SNP. After PolyPhred identifies the putative polymorphic sites, it updates the ACE and PHD files by adding tags that mark the positions of the sites. The tagged sites can then be examined using the program Consed. PolyPhred also generates a detailed output that lists the positions, genotypes and scores of the polymorphic sites in a format that can be easily parsed into a database program.
Part of the process of computing score involves averaging certain values across the reads in the assembly. For small assemblies, the accuracy of these averages increases with the number of overlapping sequences. This in turn increases the reliability of the results. We recommend that the region of interest should be covered by at least eight independent sequences, if possible.
A significant increase in the rate of true-positive SNPs can achieved by sequencing each sample in both directions. PolyPhred combines double-stranded information to enhance the accuracy of its genotype calls. To take advantage of this feature, it is necessary to use a sensible naming convention when naming the sequence data files. The sequence file names should contain a contiguous set of characters that identify the individual source. Using the -source flag (see below), PolyPhred can then match sequences that are from the same source (see also Reducing the false-positive rate).
Many of the flags have an abbreviated form, which are shown in
parentheses. Most of the flags take an argument, which is shown
in ALL CAPS
. For some flags, the argument is
optional. In these cases, the argument is indicated in sqare
brackets ([ ]
), and a default value if the
argument is omitted is shown.
All of the flags are optional. Each description indicates the argument value or action taken if the flag is omitted.
.ace.N
"
for the largest number N.POLY
,
GENOTYPE
, COLUMNGENOTYPE
,
INDEL
, POLYINDEL
,
COLUMNINDEL
, MANUALGENOTYPE
,
VERIFIED
, MICROSATELLITE
,
SAMPLE
and COVERAGE
. To include a
block, precede the block name with a plus sign
(+
). To exclude a block, precede the block name
with a minus sign (-
). For example, to exclude the
SAMPLE and COVERAGE blocks from the output report, add this to
the command line:-block -SAMPLE -COVERAGE
MICROSATELLITE
will not appear unless the
-ms
option is also given.If omitted: normal operation
DIRECTORY
must be an absolute or relative
path to the directory containing edit_dir
or to
edit_dir
itself. (see Running
PolyPhred).If omitted: PolyPhred must be run either from the
edit_dir
directory or from one directory above
edit_dir
.
Accepted numbers: 0–50
If omitted: 10 bases are reported on either side of each
reported polymorphic site.
This flag specifies a subset of the files to be used in
the analysis. PolyPhred analyzes only those sequences with a
name that matches the regular expression EXPRESSION.
PolyPhred uses the POSIX regex functions, so consult your
system documentation for more information on supported
patterns. On Linux, for instance, this can be viewed
with
man 7 regex
If omitted: .+
(All
sequences are analyzed.)
--help
-help
-h
- If omitted: normal operation
-idat
If omitted: only score SNPs
Default argument: 30
EXAMPLE:
polyphred -d /path/to/gene -i 20 -inav on -iscore 85 -s 10 13 -o output_file
This command will search for indels of length up to 20,
resolve genotypes across the sequences using the characters:
10 through 13 in a trace name to identify the name of the
individual, report those sites that score at least 85, and
write the indel.nav file in the directory
/path/to/gene/edit_dir
-md 8 5 4 4 4
4
defines a microsatellite as: a mononucleotide repeats
with at least 8 repeats, dinucleotides with at least 5 repeats,
trinucleotides with at least 3 repeats, and so on.. . If this
flag is not specified, the default used is: -md 8 8 8 8 8
8 8 8
. The operation of this flag is independent of
other microsatellite related flags (such as -ms
)
used in SNP discovery and genotyping.- Default argument: on - If omitted: off
polyphred.out
-score
Use this flag to specify a reference sequence for
reporting of polymorphic site positions. PolyPhred will use
the last sequence in the assembly whose name contains
REFID. If PolyPhred finds such a sequence, it reports
positions both relative to the consensus sequence and the
reference sequence in the output report. Note that this flag
does not use the reference sequence for comparing sites; use
-refcomp
for that. Also note that SNPs will not
be reported at any position that the reference sequence does
not cover, nor any position where the reference sequence is a
pad.
See Using a reference sequence.
- Default argument: on, using the
identifier ".REF"
- If omitted: off
-ref
.
See Using a reference sequence.
- Default argument: on, using the
identifier ".REF"
- If omitted: off
Use this flag to activate the source genotype resolution function and set the location in the chromat file names of the source identifier, or turn the function off. The source identifier is a contiguous set of characters that uniquely identifies the source of the DNA sample. PolyPhred uses the source identifier to match sequences from the same DNA sample. See Reducing the false-positive rate.
The source identifier can be placed in the chromat file
names in either of two methods. One method is to flank the
identifier characters with a delimiter. Any valid file name
character can serve as the delimiter. When running PolyPhred,
indicate the delimiter as follows ('c' is the delimiter
character):
polyphred -s /c
For example, if the chromat file names are of the form
abc-SOURCEID-xyz.scf
, then run PolyPhred
as
polyphred -s /-
to use a dash as the delimiter character.
The second method for locating the source identifier is to
place the identifier characters in a fixed location in all
chromat file names. Indicate the location of the identifier
characters as follows:
polyphred -s posn1 posn2
The positions are 1-based, meaning the first character in
a filename is indicated by 1. For example, if all chromat
file names are of the form abcSOURCExyz.scf
where SOURCE is the location of the identifier characters,
from positions 4 to 9, then run PolyPhred as follows:
polyphred -s 4 9
If the function has been activated in the .polyphredrc
file, it can turned off with the 'off' argument.
- If omitted: off
Use this flag to specify the tagging mode to use for viewing SNP sites in Consed. The three tagging modes are "genotype", "polymorphism", and "rank". The modes can be abbreviated as g, p and r, respectively.
genotype
g
polymorphism
p
rank
r
See How PolyPhred scores SNP sites for the color codes.
- If omitted: genotypeUsing the genotype tag results in putative polymorphic sites marked on the consensus sequence with color-coded tags indicating rank, and putative SNPs marked with pink tags on the sample sequences. Using the rank tag results in color-coded tags indicating rank placed on both the consensus sequence and the sample sequences (see How PolyPhred scores SNP sites for the color codes.) Using the polymorphism tag results in a blue tag placed on all putative polymorphic sites on the consensus sequence and pink tags indicating putative SNPs on the sample sequences.
- Default argument: on
- If omitted: on
- If omitted: 0
- If omitted: normal operation
- Accepted numbers: 5 - 50
- If omitted: 20
- Default argument: on
- If omitted: off (normal PolyPhred output)
A SNP site generally appears in the sequence traces as two overlapping peaks with reduced peak heights. Ideally, the areas under these two peaks are nearly the same, and the heights of the peaks are reduced by about a half of what the height of a hypothetical homozygous peak would be at the same position.
When PolyPhred identifies a putative heterozygous site in a sample sequence, it assigns the site a score that indicates how well the traces of the two peaks fit the ideal pattern for a SNP. The score values range from 99 to 0, with 99 indicating a very good fit.
If a site is determined to be homozygous, PolyPhred compares its genotype with that of a standard sequence, which can be either the consensus sequence or a user-specified reference sequence. If the genotypes do not match, the site is marked as a minor or alternative allele.
If the -source flag is used, PolyPhred combines the information in matched reads to increase the accuracy of its genotype calls. Scores for genotypes that are in agreement are increased (see Reducing the false-positive rate).
When all sites at a position (i.e., a column as viewed in Consed) have been assigned a score, PolyPhred calculates an overall score and genotype for the position. This score depends on the highest-scoring site in the sample sequences. If the overall score is greater than or equal to the score threshold (see the -score flag), then PolyPhred marks the position as polymorphic. The number of sites that PolyPhred marks can be controlled by adjusting the score threshold (see Reducing the false-positive rate).
If the six-point ranking system is selected, PolyPhred converts the score to a rank according to the table below. Along with each rank is the color of the tags as displayed in Consed.
The 'True Positive Rate' column shows the percentage of true positive SNPs marked within each rank, as found in our own analysis, using the default -score and -quality settings. These results may very depending on changes in these settings, as well as the quality of the data and number of samples analyzed.
Score | Rank | Tag Color | True Positive Rate |
99 | 1 | red | 97% |
95–98 | 2 | orange | 75% |
90–94 | 3 | green | 62% |
70–89 | 4 | dark blue | 35% |
50–69 | 5 | magenta | 11% |
0–49 | 6 | purple | 1% |
To facilitate parsing of the output file, the report is divided into several blocks. Each block begins with the token BEGIN_BLOCKNAME and ends with END_BLOCKNAME, where BLOCKNAME is the name of the block.
The output report begins with the line BEGIN_MESSAGE and ends with the line END_MESSAGE. The first block within the report is the HEADER block. This block provides the version of PolyPhred that generated the output report, a thumbprint to uniquely identify the output, the date and time the output was generated, and the directory from which PolyPhred was run.
Next is the COMMAND_LINE block. Listed in this block are the user-definable parameters that the users needs to interpret the output report, and to repeat the analysis if needed. This includes the working directory and the ACE file that was used, and those parameters that affect the analysis.
The rest of the report contains results for one or more contigs. The results for each contig are enclosed within the lines BEGIN_CONTIG and END_CONTIG. The line immediately following the BEGIN_CONTIG token provides the name of the contig. The results are then subdivided into several blocks that describe below. The user can specify which blocks appear in the output report by using the -block flag.
If the -ref flag is used, PolyPhred adds an additional field in the POLY, GENOTYPE, COLUMNGENOTYPE, INDEL, POLYINDEL, COLUMNINDEL, MANUALGENOTYPE, VERIFIED and MICROSATELLITE blocks. The extra field, which comes second after the consensus sequence position, is the position relative to a reference sequence.
POLY
blockIn this block, the putative SNP sites identified by PolyPhred are
listed, as well as sites marked by columntag type tags (see
User-defined manual tags). Each line reports
the consensus sequence position, the 5' sequence flanking the
polymorphic site, the two most common alleles at the site, the 3'
sequence flanking the site, and the over-all score assigned to
the site.
- XML tag:
block-snp_site subtag: snp_site
GENOTYPE
blockIn this block, the genotypes of the individual sample sequences
are listed for each putative SNP site listed in the POLY
block. Each line
reports the consensus sequence position, the position relative to
the sample sequence, the name of the sample sequence, the two
alleles at the position (in alphabetical order), and the score.
If the -ref or -refcomp flags were supplied, the reference sequence position appears after the consensus position.
If the -extended_genotype flag was passed to PolyPhred, two additional columns are printed indicating the direction of the read and the coordinate of the primary peak as determined by Phred. See the flags section for more information.
- XML tag: block-snp_genotype subtag: snp_genotype
COLUMNGENOTYPE
blockIn this block, the genotypes of the individual sample sequences
are listed for each manual-SNP tag applied to the consensus
sequence. Each line reports the consensus sequence position, the
position relative to the sample sequence, the name of the sample
sequence, the two alleles at the position, and the score.
PolyPhred obtains the user-defined tags from the .polyphredrc
file (see User-defined manual tags).
- XML tag:
block-manual_snp subtag: snp_genotype
COLUMNINDEL
blockIn this block, the genotypes of the individual sample sequences
are listed for each manual-indel tag. Each line reports the
consensus sequence position, the position relative to the sample
sequence, the name of the sample sequence, and the genotype. The
tag used to specify the genotype can be user-defined in the
.polyphredrc file (see User-defined manual
tags).
- XML tag:
block-manual_indel subtag:
manual_indel
MANUALGENOTYPE
block
In this block, Sample sequence sites that have been tagged
manually are listed. Each line reports the consensus sequence
position of a tagged site, the position relative to the sample
sequence that was tagged, the identity of the tag, and the
comment if one is present.
PolyPhred obtains the user-defined tags from the .polyphredrc
file (see User-defined manual tags).
- XML tag:
block-manual_genotype subtag:
manual_genotype
VERIFIED
block
In this block, sites manually tagged as verified are listed. Each
line reports the consensus sequence position and the tag
identity. PolyPhred obtains the user-defined tags from the
.polyphredrc file (see User-defined manual
tags).
- XML tag:
block-verified_site subtag:
verified_site
MICROSATELLITE
blockIf the -ms flag is set to 'on', this block lists that
microsatellite sequences that were found. Each line reports the
consensus sequence position of the 5' end of the microsatellite
and the repeat pattern.
- XML tag:
block-microsatellite subtag:
microsatellite
SAMPLE
blockThe names of the sample sequences that were analyzed and their
sequence qualities are listed in this block. Each line reports
the name of a sequence, the positions of the left and right
boundaries of the search region (between the trimmed ends), and
the average site quality, as determined by Phred, within the
search region.
- XML tag:
block-sample_quality subtag:
sample_quality
COVERAGE
block
This block provides a tally of the number of sample sequences
that PolyPhred examined at each position. Each line reports the
begin and end positions of a range relative to the consensus
sequence, followed by the number of sample sequences that were
analyzed in the range.
- XML tag:
block-coverage subtag: coverage Running
PolyPhred with -i flag adds two blocks to the output report of
PolyPhred: INDELPOLY
block which contains the information about
indel sites, and INDELGENOTYPE
block which contains information
about the genotypes at these sites.
Running PolyPhred with -i flag adds two blocks to the output
report of PolyPhred: INDELPOLY
block which contains the
information about indel sites, and INDELGENOTYPE
block which
contains information about the genotypes at these sites.
INDELPOLY
blockThis block reports information about putative indel sites. The columns are as follows:
INDELGENOTYPE
block
This block reports genotype calls for sites listed in the
INDELPOLY
block. The columns are as follows:
++
if the read is homozygous for the long allele+-
if the read is heterozygous--
if the read is homozygous for the short alleleIf the -idat flag is used, the above two blocks report additional information:
INDELPOLY
blockTwo more columns are added to the original 5 columns. The 6th column reports the log-likelihood-ratio score for the site and the 7th column reports the location of a microsatellite found upstream of the site (-1 if no microsatellite found).
INDELGENOTYPE
blockTwo columns are inserted between the 5th and the 6th columns: the first column reports log-likelihood ratio score for the genotype and the second column contains the location of a microsatellite found upstream of the site (-1 if no microsatellite found).
One of the features available in the Consed program is the ability to create custom tags. These tags can be used to mark or highlight specific sites or regions on the consensus sequence or on individual sample sequences. For example, following analysis by PolyPhred, the user can manually mark putative SNP sites as verified, or change an incorrect genotype. To create custom tags, the user needs to define the tags in the .consedrc file (see the Consed documentation under the Help menu).
PolyPhred can be set to recognize four types of custom tags, and take an appropriate action when they are encountered. This provides a way for the user to pass information from Consed to the PolyPhred output file. For example, PolyPhred can be set to recognize a custom "verified" tag and report sites marked with this tag type in the VERIFIED block of the output file. In addition, two of the custom tag types, columntag and columnindeltag, can be used to force PolyPhred to report genotypes for all sample sequences at the specified positions.
For PolyPhred to recognize the tags, they must be listed in the .polyphredrc file (see Customizing PolyPhred). Once the .polyphredrc file has been set up, the typical procedure is to 1) assemble the data, 2) run PolyPhred, 3) use Consed to analyze the results, mark sites and make changes, and 4) run PolyPhred again to obtain both the PolyPhred- and user-generated information in the output file.
The tag types are as follows:
Tags of this type is used to mark or edit a site in a sample sequence. Typically these tags are used to change the genotype call made by Phred or PolyPhred. Sites marked with these tags are listed in the MANUALGENOTYPE block.
This tag type is applied to the consensus sequence to indicate that a polymorphic site is verified. Sites marked with these tags are listed in the VERIFIED block.
Tags of this type are applied to the consensus sequence and are used to force PolyPhred to provide SNP genotypes for all of the sample sequences at the tagged sites. Sites marked by these tags are listed in the POLY block, The genotypes in the sample sequence are listed in the COLUMNGENOTYPE block.
Tags of this type are applied to the consensus sequence and are used to force PolyPhred to provide indel genotypes for all of the sample sequences at the tagged sites. The tags can be used to mark the positions and define the length of indel sites. The tag should "cover" the segment involved in the indel so that PolyPhred can report the indel segment in the output. Sites marked by these tags are listed in the POLYINDEL block, and the genotypes in the sample sequences are listed in the COLUMNINDEL block. The name of the tag that marks the site will be used to indicate the homozygous genotype. The heterozygous genotype can be set in the .polyphredrc file with the 'indelhettag' key-word. If this is not set, PolyPhred will indicate heterozygotes with the label 'heterozygoteIndel'.
This tag is added to the consensus sequence. Additional information included in the tag is: The consensus location of the indel, score of the site and the length of the indel.
This tag is added to the heterozygous genotypes at the site. Additional information included with this tag is: the consensus position, genotype, genotype score and the length of the indel in the read.
This tag is added to the homozygous genotypes at the site.
phred version 0.961028 or later phrap version 0.960731 or later phd2fasta version 0.971024 or later consed version 13.0 or later
polyphred.tar.gz
with the exact
name of the file you downloaded to. This should produce the
following files and directories:
polyphred-VERSION-binary-HOST/ bin/ polyphred the PolyPhred program polygen tool for making PHD and POLY files from ABI chromat files. sudophred tool for making chromat, PHD and POLY files from FASTA files phredPhrap perl script for running phred and phrap together in the correct order. doc/ polyphred.html this document
/usr/local/bin
cd polyphred-version-binary-host/bin cp -vi polyphred polygen sudophred phredPhrap /usr/local/bin
if ( $bUsingPolyPhred ) { print "\n\n--------------------------------------------------------\n"; print "Now running polyphred for polymorphism detection...\n"; print "--------------------------------------------------------\n\n\n"; $szPolyPhredFile = $szBaseName . ".polyphred.out"; $szPolyPhredFile = $szBaseName . ".fasta.screen.polyphred.out"; !system( "$polyPhredExe -ace $szAceFileToBeProduced > $szPolyPhredFile" ) || die "some problem running $polyPhredExe $!"; }
Read the section Customizing PolyPhred, for instructions on customizing Consed.
PolyPhred reads and modifies data files that are generated by the programs Phred and Phrap, and the can be examined by the program Consed. These programs require the sequence data files to be located in a 'work directory' containing three subdirectories called 'chromat_dir, 'phd_dir' and 'edit_dir'. In addition, PolyPhred needs a fourth subdirectory called 'poly_dir'. It is recommended that a separate working directory be created for each data set. For example, if the data set is called "mydata", a directory called mydata can be created:
Within this directory, create the four subdirectories as follows:
After these directories have been created, move or copy the chromat files to the chromat_dir directory.
If a reference sequence is to be included in the assembly, use the sudophred tool to generate fake chromat, PHD and POLY files.
To assemble the data, cd to the edit_dir directory and run
The phredPhrap script automatically runs
the programs Phred and Phrap consecutively. When the process is
complete, there should be several files in the edit_dir
,
including one with the extension .ace.1
(the ACE file), and
several files in the directories phd_dir
and poly_dir
.
View the assembled sequences in Consed. Further assembly of the data might be required. For information on this process, check the Consed documentation.
Now run PolyPhred. Include any desired flags on the command line. For example:
The output report can be viewed in a pager or text editor. Use Consed to view or edit the tags PolyPhred has placed on the assembly (see Customizing PolyPhred for more information):
There are three ways to affect the rate of false-positive calls made by PolyPhred. The best method is to use the source genotype resolution function (the -source flag). This method achieves a large reduction in false positives while minimizing the loss of true sites. To use this feature, there should be double-standed coverage (sequencing in both directions) for most or all of the samples. The sequence file (chromat) name should contain a string of contiguous characters that uniquly identify the samples. The identifier is then passed to PolyPhred using the -source flag. PolyPhred can then match reads from the same source. When the genotype calls for two matched reads are in agreement, the resulting score is increased. If the genotypes disagree, PolyPhred chooses the genotype with the greater likelyhood of being correct.
The most direct method is by using the -score flag to set the score threshold. Only sites that receive a score above this threshold are called, so increasing the threshold results in fewer calls.
For those using the using the six-point ranking system, increasing the rank threshold means setting this value to 2 or 1. This will have the same effect as increasing the score threshold to 95 or 99, respectively.
In general, the false-positive SNP call rate tends to increase near the trimmed regions at the ends of a sequence. Therefore, trimming more of the ends will tend to reduce the number of false-positive calls. The length of the trimming is increased by raising the quality threshold, which is set with the -quality flag.
For all of these methods, reducing the number of false-positive calls will also result in an increase in the number of real SNPs that are missed (false negatives). Generally, as one reduces the false-positive rate, the number of false positives that are eliminated is much greater than the number of missed real SNPs. Also, the first real sites that are missed are the rare SNPs, that is, sites with only one or two heterozygotes present in the data set.
The polygen program can be used to create PHD and POLY files using the base calls and quality scores generated by the ABI base-calling software. This method is an alternative to using the Phred base-calling program.
Polygen can be run from either the edit_dir directory of the directory above it (the work directory). To run the program, enter:
It can also be run from any other directory by using the -dir (or -d) flag to specify the work directory where the data is located, similar to the -dir flag for PolyPhred. For example:
The program looks in the chromat_dir directory for the chromat files. It creates a PHD and POLY file for each chromat file that lacks a PHD file. The PHD files are written into the phd_dir directory, and the POLY files are written into the poly_dir directory.
Alternatively, the -list (-l) flag can be used to specify a file containing a list of chromat files. Polygen creates PHD and POLY files from these chromat files instead. Each line in the file should be the name of a file in the chromat_dir, or a path relative to the chromat_dir. If the filename is a single dash, the list is read from standard input instead.
To force polygen to overwrite any existing PHD and POLY files,
use the -overwrite
(-o
) flag.
Run polygen -h
or polygen --help
to show a list of the
options.
Run polygen -v
or polygen --version
to show the
version.
The sudophred program is a tool that can be used to generate fake chromat, PHD and POLY files from DNA sequences in FASTA format. Fake chromat and PHD files are needed if one wishes to include a reference sequence in the assembly of the data set (see Using a reference sequence). Also, if one wants to compare data from sequence trace (chromat) files with text sequences, the text sequences need to be converted into all three file types.
The sudophred program takes one text file as input. The text
file can contain one or more sequences in FASTA format. If one is
generating fake data files for a reference sequence, sudophred
writes data files for the first sequence only. Otherwise,
sudophred will generate data files for each of the sequences in
the text file. In either case, the names of the data files are
taken from the string that follows the greater-than symbol (>
) at the beginning of
each sequence.
One way to run sudophred is to put the FASTA file into the
edit_dir
directory. Sudophred will create each file that it
generates in the appropriate directory. That is, the chromat file will be
created in the chromat_dir
directory, the phd file in
phd_dir
, and the poly file in poly_dir
. One can
also put the FASTA file in an arbitrary directory and run sudophred from
there. In this case, or if the normal directories cannot be found, sudophred
will write all of the files into that same directory. The files must then be
moved to the appropriate data subdirectories. In either case, it is easiest
to generate the fake data files before running the phredPhrap program that
assembles the data into contigs.
By default, sudophred writes all three files. The chromat files are written in SCF format. In the phd files, all quality values are set to 59.
To run sudophred, enter:
where filename is the name of the text file containing the sequences. The file name must always be the first argument.
To use sudophred to generate files a reference sequence, use the
-r
flag. This flag can be followed by a string that PolyPhred
will use to identify the reference sequence. For example,
will instruct sudophred to create a sequence whose name ends with
‘.XYZ
’. If no string is supplied, sudophred will use
the default string ‘.REF
’.
To change the quality threshold, use the -q flag followed by the value (an integer from 0 to 59). For example:
To write the chromat files in ABI format, use the -abi flag:
Run "sudophred -h" or "sudophred -help" to show a list of the options.
Run "sudophred -v" or "sudophred -version" to show the version.
For the purpose of locating SNPs and other features on a standard sequence map, it is useful to include the standard, or reference sequence in the data assembly. One can then run PolyPhred with the -ref flag to obtain the SNP positions relative to that reference sequence. Further more, one might want to have PolyPhred compare the sample sequences with the reference sequence rather than with the consensus sequence that is generated by Phrap. This can be done by running PolyPhred with the -refcomp flag.
When the either the -ref or -refcomp flag is used, PolyPhred reports in the output file two positions rather than one. The blocks displaying this alternate format are the POLY, GENOTYPE, COLUMNGENOTYPE, INDEL, POLYINDEL, COLUMNINDEL, MANUALGENOTYPE, VERIFIED and MICROSATELLITE. In each block, the first number is the position of the feature relative to the consensus sequence, and the second is the position relative to the reference sequence.
To include a reference sequence in the assembly, one should first create the necessary data files from the reference sequence. These files can be generated with the sudophred program supplied with PolyPhred (see Using the sudophred tool).
Use sudophred with the -r flag to generate the reference sequence data files. For example,
where filename is the name of the text file containing the reference sequence in FASTA format. The data files will be given names that begin with the string that follows the '>' at the beginning of the sequence, followed by the default reference identifier ".REF". In this case, one would run PolyPhred with the reference options as follows:
To specify a different reference identifier, follow the -r flag with the identifier string. For example, to set the reference identifier as ‘xYZ’, run:
PolyPhred can be customized to suit the preferences of the user by creating a .polyphredrc file. The .polyphredrc file allows the user to change default parameter values, as well as specify any manual tags that PolyPhred should capture and written in the output report. This file is optional, and if it is not present, PolyPhred will used its built-in default parameter values and will not capture manual tags.
When PolyPhred starts, it looks for a .polyphredrc file in three locations. It first looks in the user's current directory. If the file is not found there, PolyPhred looks in the user's home directory. If the file is still not found, PolyPhred looks for a directory in the user's shell rc file. The directory is specified by including in the shell rc file the line:
where [path] is the directory containing the .polyphredrc file.
Each line in the .polyphredrc file can be either a blank line, a line beginning with a '#' character, indicating a comment, or with one of the following key-words:
The 'flag' key-word can used with any of the command-line flags to change a default value. For example, to will change the default score threshold to 80 and the quality threshold to 30, enter these lines in the .polyphredrc file:
The following line
changes two defaults; it will set the name of the output file to 'out.txt' and cause PolyPhred to write the output in a file with that name rather than to the screen. To change the default file name but keep output to the screen as the default activity, use the 'outputfile' key-word, as:
Then, to use the new default output file name, run ‘polyphred
-o on
’.
Similarly, the both lines below change the default name of the navigation file, but the first line causes PolyPhred to write a navigation file by default, while the second line leaves the default activity off. flag -nav [file name] navfile [file name]
All three lines below change the default reference sequence identifier. The first two lines turn on ref and refcomp modes, respectively, while the third line does not affect the reference mode. Note that sequences containing the reference identifier are excluded from regular processing, even when ref and refcomp modes are disabled.
The 'ranks' key-word allows the user to change the values used to convert scores to ranks (see How PolyPhred scores SNP sites). For example, the following line:
will result in these conversions:
Probability | Rank |
99-90 | 1 |
89-80 | 2 |
79-60 | 3 |
59-40 | 4 |
39-20 | 5 |
19-0 | 6 |
The 'acedir', 'phddir' and 'polydir' key-words allow the user to set the locations for the data files to directories other than the ones that required by Phred, Phrap and Consed. The 'acedir' sets the location of the ace file (which is normally in the edit_dir directory). The 'phddir' and 'polydir' key-words specify the locations of the phd and poly files, respectively. A directory is considered to be within the work directory, unless an absolute path starting with '/' is given. Use a '.' to indicate that a directory is the same as the work directory.
The 'date' key-word allows the user to set the format of the date that appears at the top of the output file. The key-word must be followed by one of six format codes:
Format code | Example |
DMY | 31/12/07 |
MDY | 12/31/07 |
YMD | 07/12/31 |
DMYY | 31/12/2007 |
MDYY | 12/31/2007 |
YYMD | 2007/12/31 |
The default is the DMY format.
Four of the key-words set tag names for the four tag types (see
User-defined manual tags). Each tag type can
have more than one name (see the example .polyphredrc file
below). In addition, the indelhettag key-word allows the user to
specify the tag that will be used to indicate heterozygous
indels.
Here is an example of a .polyphredrc file:
# PolyPhred configuration file # Set the date format to YYYY-MM-DD date YYMD flag -q 25 # Quality threshold flag -f 16 # Flanking length # Send output to edit_dir/report.txt outputfile report.txt # Treat files containing `.refSeq' as reference sequences refID .refSeq # Manual tags to read from ACE file and include in output report verifiedtag polymorphism columntag manualGenotype columnindeltag indel:++ columnindeltag indel:-- indelhettag indel:+- manualtag heterozygote manualtag homozygote manualtag indel
As with PolyPhred, the flag key-word can be used to set any of polygen's flags.
The 'chromatdir', 'phddir' and 'polydir' key-words allow the user to set the locations for the data files to directories other than the ones that required by Phred, Phrap and Consed. The 'chromatdir' sets the location of the chromat files. The 'phddir' and 'polydir' key-words specify the locations of the phd and poly files, respectively. Each directory path given is assumed to be relative to the work directory, unless an absolute path is given (starts with a '/'). Use a '.' to indicate that a directory is the same as the work directory.
If you have questions or problems with Phred, Phrap or Consed,
or you need to obtain these programs, please see the web site
at:
http://www.phrap.org
If you have questions, comments, or bug reports regarding PolyPhred, please:
polyphred at u dot washington dot
edu
. Be as specific as possible. You should indicate
which platform and version of PolyPhred you are using, steps to
reproduce the problem, what behavior you expected, and what
platform you are running on.Please do not email questions to the webmaster.
1. Kwok, P.Y., Carlson, C., Yager, T.D., Ankenar, W., and Nickerson, D.A., 1994 "Comparative analysis of human DNA variations by fluorescence-based sequencing of PCR products", Genomics 25, 615-622. 2. Nickerson, D.A., Tobe, V.O., and Taylor, S.L, 1997, "PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing", Nucleic Acids Research, 25: 2745-2751. 3. Ewing, B., Hillier, L., Wendl, M., and Green, P., 1998, "Basecalling of automated sequencer traces using phred. I. Accuracy assesment", Genome Research 8: 175-185. 4. Ewing, B. and Green, P., 1998, "Basecalling of automated sequencer traces using phred. II. Error probabilities", Genome Research 8: 186-194. 5. Green, P., 1994, Phrap, unpublished. http://www.phrap.org 6. Gordon, D., Abajian, C., and Green, P., 1998, "Consed: A grapical tool for sequence finishing", Genome Research 8:195-202. 7. Stephens M, Sloan JS, Robertson PD, Scheet P, Nickerson DA., 2006, "Automating sequence-based detection and genotyping of SNPs from diploid samples," Nat Genet. 2006 Mar;38(3):375-81. Epub 2006 Feb 19. 8. Bhangale T., Stephens M., Nickerson DA., 2006, "Automating resequencing-based detection of insertion-deletion polymorphisms" (submitted).