Last modified: May 5, 2008

Program: PolyPhred
Version: 6.15 Beta
Copyright © 2005-2008
by Deborah A. Nickerson, Scott Taylor, Natali Kolker, Jim Sloan, Tushar Bhangale, Matthew Stephens, and Ian Robertson
University of Washington

All rights reserved.

This software is part of a test version of the PolyPhred distribution package. It may not be redistributed, distributed in modified form, or used for any commercial purpose, including commercially funded sequencing, without written permission from the authors and the University of Washington.

This software is provided "as is" and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In particular, this disclaimer applies to any diagnostic purpose. In no event shall the authors or the University of Washington be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.

Contents

  1. Description of Features
  2. Setup and Operating Instructions
  3. More Information

Note to PolyPhred version 6.0 Beta Users

This version contains a new method to detect and genotype diallelic indels (activated by -indel flag). The SNP detection method used by this version is the same as the one in PolyPhred version 5.04. The indel detection method focuses on accurately identifying the traces containing heterozygous indels. In short, it first identifies such trace and determines the length and the location of the indel in each of these traces; it combines the results across the traces that align at a locus to find indel sites in the sample population; and it then determines the genotypes of the individuals in the sample at each site. For every site, it computes and reports a score which reflects the strength of the evidence for an indel at that site. Similarly it reports a genotype score for every individual (or trace) at the site, which summarizes the confidence that should be placed in the genotype call. Based on our tests, the method will identify the indel lengths very accurately (~96% of the time). The determined indel location may however be a few basepairs upstream or downstream of the location a human expert might report. Therefore, a manual tagging system is provided for marking the correct locations of indels. The corrected locations and lengths will be reported in the PolyPhred output (see User-defined manual tags). PolyPhred will mark an indel site on the consensus sequence with an 'indelSite' tag and the traces containing heterozygous indels with a 'heterozygoteIndel' tag. Reamaining traces at the site will be marked with a 'homozygoteIndel' tag. The site score, genotype score, and the indel length are reported in the tags (visible in Consed). Versions of Consed prior to version 13.0 are not able to interpret the indel tags. To solve this problem, it is necessary to modify the .consedrc file. Add the following lines to the .consedrc file:

consed.customConsensusTag1: indelSite
consed.tagColorCustomConsensusTag1: DarkCyan
consed.customTag1: indel
consed.tagColorCustomTag1: DarkOrange

If the 'customConsensusTag1' and 'customTag1' tags are already used, change the final number 1 in the tag names to the next available number.


Introduction

Single nucleotide polymorphisms (SNPs) are the most frequent form of DNA sequence variation in the human genome. The identification and typing of these variations plays a central role in analyzing the relationships between genome structure and function, and in understanding the allelic variation within and among populations.

Many techniques are used to identify sequence variants among different individuals using DNA amplified by the polymerase chain reaction (PCR). These include denaturing gel electrophoresis, chemical or enzymatic cleavage, heteroduplex analysis, the analysis of single-stranded DNA conformations, variant detector arrays, and direct sequencing of a PCR product. PolyPhred is a program that helps to accurately identify heterozygous sites in sequences produced by sequencing PCR products with fluorescence-based chemistries such as dye labeled terminators or dye-labeled primers. The program compares sequence traces and searches for homozygotes and heterozygotes.

Identification of potential heterozygous sites is based on 1) the presence of two significant overlapping fluorescence peaks at such sites in the sequence trace, and 2) detecting a decrease of about 50% in the peak heights when the sequence trace is compared with that obtained from homozygous individuals (references 1 and 2). PolyPhred scans for these two features when sequence traces are being compared to detect heterozygotes among homozygotes (reference 2). In addition, if double-stranded coverage of the reads is provided, the accuracy of the results is significantly increased.

PolyPhred is not a stand alone program. It is designed as a member of an integrated suite of sequence analysis applications that includes the programs Phred (references 3,4), Phrap (reference 5), and Consed (reference 6).


How PolyPhred works

PolyPhred identifies potential heterozygous sites by comparing traces in a sequence assembly. Phred provides the base-calls, base quality information and the peak size information, which is stored in two types of files called PHD and POLY files. Phrap is used to assemble the input sequences into one or more contigs, and to derive a consensus sequence for each contig. The assembly information is stored in a file called the ACE file. PolyPhred uses all three file types to analyze the sequence traces. It first reads the ACE file to obtain the consensus sequence and the names of the trace (chromat) files used in the assembly. It then reads the PHD and POLY files associated with each trace.

During the SNP search phase, PolyPhred combines information from all of the sequence traces to derive a genotype and a score for each sequence (see How PolyPhred scores SNP sites). It also uses a standard sequence for comparison to identify sites that are homozygous for a minor or alternative allele. The score indicates how well the trace at the site matches the expected pattern for a SNP. After PolyPhred identifies the putative polymorphic sites, it updates the ACE and PHD files by adding tags that mark the positions of the sites. The tagged sites can then be examined using the program Consed. PolyPhred also generates a detailed output that lists the positions, genotypes and scores of the polymorphic sites in a format that can be easily parsed into a database program.

Part of the process of computing score involves averaging certain values across the reads in the assembly. For small assemblies, the accuracy of these averages increases with the number of overlapping sequences. This in turn increases the reliability of the results. We recommend that the region of interest should be covered by at least eight independent sequences, if possible.

A significant increase in the rate of true-positive SNPs can achieved by sequencing each sample in both directions. PolyPhred combines double-stranded information to enhance the accuracy of its genotype calls. To take advantage of this feature, it is necessary to use a sensible naming convention when naming the sequence data files. The sequence file names should contain a contiguous set of characters that identify the individual source. Using the -source flag (see below), PolyPhred can then match sequences that are from the same source (see also Reducing the false-positive rate).


The Flags

Many of the flags have an abbreviated form, which are shown in parentheses. Most of the flags take an argument, which is shown in ALL CAPS. For some flags, the argument is optional. In these cases, the argument is indicated in sqare brackets ([ ]), and a default value if the argument is omitted is shown.

All of the flags are optional. Each description indicates the argument value or action taken if the flag is omitted.

-ace FILE
-a FILE
Use this flag to specify the ACE file to be read by PolyPhred.
If omitted: the most recent ACE file is used. (Specifically, the file name containing ".ace.N" for the largest number N.
-block +BLOCK-NAME [ +BLOCK-NAME ... ]
-block -BLOCK-NAME [ -BLOCK-NAME ... ]
Use this flag to include or exclude blocks from the output file. The valid block names are POLY, GENOTYPE, COLUMNGENOTYPE, INDEL, POLYINDEL, COLUMNINDEL, MANUALGENOTYPE, VERIFIED, MICROSATELLITE, SAMPLE and COVERAGE. To include a block, precede the block name with a plus sign (+). To exclude a block, precede the block name with a minus sign (-). For example, to exclude the SAMPLE and COVERAGE blocks from the output report, add this to the command line:
-block -SAMPLE -COVERAGE
If omitted: all blocks are included in the output. MICROSATELLITE will not appear unless the -ms option is also given.
-clear
Use this flag to remove all PolyPhred tags from the ACE and PHD files.

If omitted: normal operation

-dir DIRECTORY
-d DIRECTORY
Use this flag to specify the location of the data. This flag allows PolyPhred to be run from a directory other than the one containing the data to be analyzed. DIRECTORY must be an absolute or relative path to the directory containing edit_dir or to edit_dir itself. (see Running PolyPhred).

If omitted: PolyPhred must be run either from the edit_dir directory or from one directory above edit_dir.

-extended_genotype
Adding this flag causes PolyPhred to append two extra columns to the data in the GENOTYPE block. The first contains a C or a U indicating the complementedness of the read. Forward reads are indicated by 'U', and reverse reads are indicated by 'C'. The second additional column contains the coordinate of the primary peak as indicated by the .poly file associated with the read.
-flanking 0–50
-f 0–50
Use this flag to specify the number of bases flanking the polymorphic sites reported in the POLY and POLYINDEL blocks of the PolyPhred output.

Accepted numbers: 0–50
If omitted: 10 bases are reported on either side of each reported polymorphic site.

-group EXPRESSION
-g EXPRESSION

This flag specifies a subset of the files to be used in the analysis. PolyPhred analyzes only those sequences with a name that matches the regular expression EXPRESSION. PolyPhred uses the POSIX regex functions, so consult your system documentation for more information on supported patterns. On Linux, for instance, this can be viewed with
man 7 regex

If omitted:  .+ (All sequences are analyzed.)

--help
-help
-h
Use this flag to see information on how to use PolyPhred. The flags are listed along with their allowed and default values.

-  If omitted: normal operation

-idat
This is an optional flag. If included, the method writes the results of trace-by-trace analysis by the indel algorithm (see -indel flag, below) to the standard output for debugging and computing the detection accuracy. To avoid mixing this output with PolyPhred's output report, use the -o flag to specify filename for PolyPhred's output report.
-indel [ INDEL-LENGTH ]
-i [ INDEL-LENGTH ]
This flag instructs PolyPhred to run the indel detection algorithm. The flag can be optionally followed by an integer from 1 to 30, which specifies the highest value of the length of indel the method searches for. For example, to search for indels of length up to 15, use: -i 15. The computational time is proportional to the value of this integer and the method requires approximately 4 hours to analyze a 30Kb gene sequenced across 47 individuals when searching for indels of length up to 30. In our datasets ~85% indels are <5bp, ~95% indels are <13bp and ~99% indels are <31bp in lengths. The computational time can be considerably shortened by using a smaller value of the indel length supplied to this flag. The method uses the basecalls of the reads containing heterozygous indels in the data. Therefore, if the basecalls have been removed manually or otherwise (e.g. converted to 'N') in order to prevent PolyPhred from reporting sites in these reads as SNPs, the indel detection method will not function properly. The -s flag to activate the source genotype resolution function is also applicable and recommended for this algorithm. The method, by default, does not report indels that follow poly tracks with 8 or more repeats (as these can be indel errors during PCR amplification). This default can however be changed using the -md flag (see below). The following flags are functional only if -i flag is used: -inav, -iscore, -md, -idat.

If omitted: only score SNPs
Default argument: 30

-inav [ on / off / FILENAME ]
This is an optional flag that can be used to create a "navigation" file. Using a navigation file is a convenient and quicker way to confirm the indels identified by the method when using Consed. We highly recommend using the navigation file to browse and confirm the indels found by the method. The flag can be followed by on /off /filename. If no filename is specified (e.g., -inav on), it creates a file called as indel.nav in the edit_dir of the gene. To use this file, from the main window of Consed select Navigate -> Custom navigate -> filename. This opens a popup window. Each entry in the window corresponds to an indel site and displays the Contig Name, name of the highest scoring heterozygous read at the site, the consensus position of the indel, length of the indel, score assigned to the heterozygous genotype of that read, and score assigned to the indel site. Double-clicking on an entry in this list will focus the cursor on the "best" heterozygous indel read at that site in the Aligned Reads Window of Consed. The corresponding trace can then be visualized by middle-clicking on the read.
-iscore NUMBER
This is an optional flag that has to be followed by an integer value (from 0 to 99). This value specifies the score cutoff for reporting indel sites. For example, -iscore 80 will only report sites that have a score of at least 80. If the flag is not specified the default value for cutoff used is 80.

EXAMPLE:

polyphred -d /path/to/gene -i 20 -inav on -iscore 85 -s 10 13 -o output_file

This command will search for indels of length up to 20, resolve genotypes across the sequences using the characters: 10 through 13 in a trace name to identify the name of the individual, report those sites that score at least 85, and write the indel.nav file in the directory /path/to/gene/edit_dir

-md N N N N N ...
This optional flag can be used to specify a definition of a microsatellite. Indels that occur downstream of these will not be reported. To define a microsatellite using this flag, specify a sequence (of length up to 8) of integers. Each of the values corresponds to the minimum number of repeats of the unit, where the length of the unit equals to the index of the integer in the sequence. For example, -md 8 5 4 4 4 4 defines a microsatellite as: a mononucleotide repeats with at least 8 repeats, dinucleotides with at least 5 repeats, trinucleotides with at least 3 repeats, and so on.. . If this flag is not specified, the default used is: -md 8 8 8 8 8 8 8 8. The operation of this flag is independent of other microsatellite related flags (such as -ms) used in SNP discovery and genotyping.
-ms [x / on / off]
Use this flag to switch on or off the marking of simple microsatellite repeats. If the argument 'x' is passed, putative SNP sites that are found within microsatellites are given a score equal to the score limit (see -score).

-  Default argument: on -  If omitted: off

-nav [ FILENAME / on / off]
-n [ FILENAME / on / off]
Use this flag to generate a navigation file listing the polymorphic sites. If the file name is given but does not have a final ".nav" extension, PolyPhred adds one. The file is written to the edit_dir directory of the working directory.
-  Default argument: on, using the file name "polyphred.nav"
-  If omitted: off
To use the navigation file in Consed, click on 'Navigate', located at the top of the 'Consed Main Window'. Then click on 'Custom Navigation'. The window that appears should contain the name of the navigation file. Click on the file name to bring up the navigation window.
-output [ FILENAME / on / off ]
-o [ FILENAME / on / off ]
Use this flag to send the PolyPhred output either to a file or to the standard output (the screen). If the argument is "off", the output is written to the screen. In this case, the output can be redirected to a file using '>'.
-  Default argument: on, using the file name polyphred.out
-  If omitted: off
-quality THRESHOLD -qTHRESHOLD
Use this flag to set the quality threshold. PolyPhred uses the quality threshold to determine the extent of the excluded, or trimmed, regions at the ends of the sample sequences (the regions shaded in yellow when the assembly is viewed in Consed). Reducing this value results in less trimming of the ends. See Reducing the false-positive rate.
-  Accepted value: 0 - 50
-  If omitted: 25
-rank [ THRESHOLD / on / off ]
Use this flag to direct PolyPhred to score sites with the six-point ranking system. To set the rank threshold, follow the flag with a number from 1 to 6. PolyPhred marks and reports only sites that are assigned a rank between 1 and the rank threshold, inclusive. See Reducing the false-positive rate.
-  Accepted value: 1 - 6
-  Default argument: on, using the value 3
-  If omitted: the 100-point scoring system is used, as with -score
-ref[REFID / on / off]

Use this flag to specify a reference sequence for reporting of polymorphic site positions. PolyPhred will use the last sequence in the assembly whose name contains REFID. If PolyPhred finds such a sequence, it reports positions both relative to the consensus sequence and the reference sequence in the output report. Note that this flag does not use the reference sequence for comparing sites; use -refcomp for that. Also note that SNPs will not be reported at any position that the reference sequence does not cover, nor any position where the reference sequence is a pad.

See Using a reference sequence.

-  Default argument: on, using the identifier ".REF"
-  If omitted: off

-refcomp [ REFID / on / off ]
Use this flag to direct PolyPhred to use a reference sequence as the standard rather than the consensus sequence. PolyPhred will mark all sites that differ from the reference sequence, including homozygotes. Using this flag implies -ref.

See Using a reference sequence.

-  Default argument: on, using the identifier ".REF"
-  If omitted: off

-source /C
-s /C
-source POSN1 POSN2
-s POSN1 POSN2
-source off

Use this flag to activate the source genotype resolution function and set the location in the chromat file names of the source identifier, or turn the function off. The source identifier is a contiguous set of characters that uniquely identifies the source of the DNA sample. PolyPhred uses the source identifier to match sequences from the same DNA sample. See Reducing the false-positive rate.

The source identifier can be placed in the chromat file names in either of two methods. One method is to flank the identifier characters with a delimiter. Any valid file name character can serve as the delimiter. When running PolyPhred, indicate the delimiter as follows ('c' is the delimiter character):
polyphred -s /c

For example, if the chromat file names are of the form abc-SOURCEID-xyz.scf, then run PolyPhred as
polyphred -s /-
to use a dash as the delimiter character.

The second method for locating the source identifier is to place the identifier characters in a fixed location in all chromat file names. Indicate the location of the identifier characters as follows:
polyphred -s posn1 posn2

The positions are 1-based, meaning the first character in a filename is indicated by 1. For example, if all chromat file names are of the form abcSOURCExyz.scf where SOURCE is the location of the identifier characters, from positions 4 to 9, then run PolyPhred as follows:
polyphred -s 4 9

If the function has been activated in the .polyphredrc file, it can turned off with the 'off' argument.
-  If omitted: off

-score [ NUMBER ]
Use this flag to select the 100-point scoring system and set the score threshold. PolyPhred marks and reports only sites that are assigned a score between 99 and the score threshold, inclusive. See Reducing the false-positive rate.
-  Accepted numbers: 0 - 99
-  If omitted or argument omitted: the 100-point scoring system is used with a score threshold of 70
-snp [ het / hom / on / off ]
Use this flag to switch on or off SNP detection, or to select either the marking of heterozygous (het) or homozgous (hom) polymorphisms only.
-  Default argument: on, marking both heterozygous and homozygous polymorphisms
-  If omitted: on
-tag TAG-MODE
-t TAG-MODE

Use this flag to specify the tagging mode to use for viewing SNP sites in Consed. The three tagging modes are "genotype", "polymorphism", and "rank". The modes can be abbreviated as g, p and r, respectively.

genotype
g
In genotype mode, polymorphic sites are tagged on the consensus sequence with color indicating rank. Putative SNPs are marked in pink on the sample sequences.
polymorphism
p
In polymorphism mode, polymorphic sites are marked in blue on the consensus sequence, and SNPs ar marked in pink on the sample sequences.
rank
r
In rank mode, color-coded tags indicating rank are placed on both the consensus and sample sequences.

See How PolyPhred scores SNP sites for the color codes.

-  If omitted: genotype

Using the genotype tag results in putative polymorphic sites marked on the consensus sequence with color-coded tags indicating rank, and putative SNPs marked with pink tags on the sample sequences. Using the rank tag results in color-coded tags indicating rank placed on both the consensus sequence and the sample sequences (see How PolyPhred scores SNP sites for the color codes.) Using the polymorphism tag results in a blue tag placed on all putative polymorphic sites on the consensus sequence and pink tags indicating putative SNPs on the sample sequences.

-update [on / off]
Use this flag to control updating of the ACE and PHD files. If updating is switched off, the ACE and PHD files are not updated, and the PolyPhred results can not be viewed in Consed.

-  Default argument: on
-  If omitted: on

-verbosity [0 / 1 / 2]
-v [0 / 1 / 2]
Use this flag to set the level of status reporting that will written to the screen as PolyPhred is running. The allowed arguments range from 0 (least reporting) to 2 (most reporting).

-  If omitted: 0

--version
-version
Use this flag to see the PolyPhred version and build number.

-  If omitted: normal operation

-window NUMBER
-w NUMBER
Use this flag to set the window width. PolyPhred uses the window width, together with the quality threshold, to determine the extent of the excluded, or trimmed, regions at the ends of the sample sequences (the regions shaded in yellow when the assembly is viewed in Consed).

-  Accepted numbers: 5 - 50
-  If omitted: 20

-xml [on / off]
Use this flag to specify the format of the PolyPhred output.

-  Default argument: on
-  If omitted: off (normal PolyPhred output)


How PolyPhred scores SNP sites

A SNP site generally appears in the sequence traces as two overlapping peaks with reduced peak heights. Ideally, the areas under these two peaks are nearly the same, and the heights of the peaks are reduced by about a half of what the height of a hypothetical homozygous peak would be at the same position.

When PolyPhred identifies a putative heterozygous site in a sample sequence, it assigns the site a score that indicates how well the traces of the two peaks fit the ideal pattern for a SNP. The score values range from 99 to 0, with 99 indicating a very good fit.

If a site is determined to be homozygous, PolyPhred compares its genotype with that of a standard sequence, which can be either the consensus sequence or a user-specified reference sequence. If the genotypes do not match, the site is marked as a minor or alternative allele.

If the -source flag is used, PolyPhred combines the information in matched reads to increase the accuracy of its genotype calls. Scores for genotypes that are in agreement are increased (see Reducing the false-positive rate).

When all sites at a position (i.e., a column as viewed in Consed) have been assigned a score, PolyPhred calculates an overall score and genotype for the position. This score depends on the highest-scoring site in the sample sequences. If the overall score is greater than or equal to the score threshold (see the -score flag), then PolyPhred marks the position as polymorphic. The number of sites that PolyPhred marks can be controlled by adjusting the score threshold (see Reducing the false-positive rate).

If the six-point ranking system is selected, PolyPhred converts the score to a rank according to the table below. Along with each rank is the color of the tags as displayed in Consed.

The 'True Positive Rate' column shows the percentage of true positive SNPs marked within each rank, as found in our own analysis, using the default -score and -quality settings. These results may very depending on changes in these settings, as well as the quality of the data and number of samples analyzed.

Accuracy by score and rank
Score Rank Tag Color True Positive Rate
99 1 red 97%
95–98 2 orange 75%
90–94 3 green 62%
70–89 4 dark blue 35%
50–69 5 magenta 11%
0–49 6 purple 1%

The output report

To facilitate parsing of the output file, the report is divided into several blocks. Each block begins with the token BEGIN_BLOCKNAME and ends with END_BLOCKNAME, where BLOCKNAME is the name of the block.

The output report begins with the line BEGIN_MESSAGE and ends with the line END_MESSAGE. The first block within the report is the HEADER block. This block provides the version of PolyPhred that generated the output report, a thumbprint to uniquely identify the output, the date and time the output was generated, and the directory from which PolyPhred was run.

Next is the COMMAND_LINE block. Listed in this block are the user-definable parameters that the users needs to interpret the output report, and to repeat the analysis if needed. This includes the working directory and the ACE file that was used, and those parameters that affect the analysis.

The rest of the report contains results for one or more contigs. The results for each contig are enclosed within the lines BEGIN_CONTIG and END_CONTIG. The line immediately following the BEGIN_CONTIG token provides the name of the contig. The results are then subdivided into several blocks that describe below. The user can specify which blocks appear in the output report by using the -block flag.

If the -ref flag is used, PolyPhred adds an additional field in the POLY, GENOTYPE, COLUMNGENOTYPE, INDEL, POLYINDEL, COLUMNINDEL, MANUALGENOTYPE, VERIFIED and MICROSATELLITE blocks. The extra field, which comes second after the consensus sequence position, is the position relative to a reference sequence.

The POLY block

In this block, the putative SNP sites identified by PolyPhred are listed, as well as sites marked by columntag type tags (see User-defined manual tags). Each line reports the consensus sequence position, the 5' sequence flanking the polymorphic site, the two most common alleles at the site, the 3' sequence flanking the site, and the over-all score assigned to the site.
-  XML tag: block-snp_site    subtag: snp_site

The GENOTYPE block

In this block, the genotypes of the individual sample sequences are listed for each putative SNP site listed in the POLY block. Each line reports the consensus sequence position, the position relative to the sample sequence, the name of the sample sequence, the two alleles at the position (in alphabetical order), and the score.

If the -ref or -refcomp flags were supplied, the reference sequence position appears after the consensus position.

If the -extended_genotype flag was passed to PolyPhred, two additional columns are printed indicating the direction of the read and the coordinate of the primary peak as determined by Phred. See the flags section for more information.

-  XML tag: block-snp_genotype    subtag: snp_genotype

The COLUMNGENOTYPE block

In this block, the genotypes of the individual sample sequences are listed for each manual-SNP tag applied to the consensus sequence. Each line reports the consensus sequence position, the position relative to the sample sequence, the name of the sample sequence, the two alleles at the position, and the score. PolyPhred obtains the user-defined tags from the .polyphredrc file (see User-defined manual tags).
-  XML tag: block-manual_snp    subtag: snp_genotype

The COLUMNINDEL block

In this block, the genotypes of the individual sample sequences are listed for each manual-indel tag. Each line reports the consensus sequence position, the position relative to the sample sequence, the name of the sample sequence, and the genotype. The tag used to specify the genotype can be user-defined in the .polyphredrc file (see User-defined manual tags).
-  XML tag: block-manual_indel    subtag: manual_indel

The MANUALGENOTYPE block

In this block, Sample sequence sites that have been tagged manually are listed. Each line reports the consensus sequence position of a tagged site, the position relative to the sample sequence that was tagged, the identity of the tag, and the comment if one is present.
PolyPhred obtains the user-defined tags from the .polyphredrc file (see User-defined manual tags).
-  XML tag: block-manual_genotype    subtag: manual_genotype

The VERIFIED block

In this block, sites manually tagged as verified are listed. Each line reports the consensus sequence position and the tag identity. PolyPhred obtains the user-defined tags from the .polyphredrc file (see User-defined manual tags).
-  XML tag: block-verified_site    subtag: verified_site

The MICROSATELLITE block

If the -ms flag is set to 'on', this block lists that microsatellite sequences that were found. Each line reports the consensus sequence position of the 5' end of the microsatellite and the repeat pattern.
-  XML tag: block-microsatellite    subtag: microsatellite

The SAMPLE block

The names of the sample sequences that were analyzed and their sequence qualities are listed in this block. Each line reports the name of a sequence, the positions of the left and right boundaries of the search region (between the trimmed ends), and the average site quality, as determined by Phred, within the search region.
-  XML tag: block-sample_quality    subtag: sample_quality

The COVERAGE block

This block provides a tally of the number of sample sequences that PolyPhred examined at each position. Each line reports the begin and end positions of a range relative to the consensus sequence, followed by the number of sample sequences that were analyzed in the range.
-  XML tag: block-coverage    subtag: coverage Running PolyPhred with -i flag adds two blocks to the output report of PolyPhred: INDELPOLY block which contains the information about indel sites, and INDELGENOTYPE block which contains information about the genotypes at these sites.

The INDELPOLY and INDELGENOTYPE blocks are new to version 6

Running PolyPhred with -i flag adds two blocks to the output report of PolyPhred: INDELPOLY block which contains the information about indel sites, and INDELGENOTYPE block which contains information about the genotypes at these sites.

The INDELPOLY block

This block reports information about putative indel sites. The columns are as follows:

  1. the consensus position of the indel site
  2. the smallest value among the consensus positions of indels found in the heterozygotes at the site (as the indel positions determined by the method may be different for different heterozygous read at a given indel site)
  3. the largest value among the consensus positions of indels found in the heterozygotes at the site
  4. the length of the indel
  5. the score assigned to the site

The INDELGENOTYPE block

This block reports genotype calls for sites listed in the INDELPOLY block. The columns are as follows:

  1. the consensus position of the indel site
  2. the consensus position of the indel found in the read (if the genotype is not heterozygous, this value is the same as 1)
  3. the length of the indel found in the read (for homozygotes, this value is 0)
  4. the name of the read
  5. the genotype score of the read
  6. the genotype:
    • ++ if the read is homozygous for the long allele
    • +- if the read is heterozygous
    • -- if the read is homozygous for the short allele

Options that affect columns displayed in each block

If the -idat flag is used, the above two blocks report additional information:

The INDELPOLY block

Two more columns are added to the original 5 columns. The 6th column reports the log-likelihood-ratio score for the site and the 7th column reports the location of a microsatellite found upstream of the site (-1 if no microsatellite found).

The INDELGENOTYPE block

Two columns are inserted between the 5th and the 6th columns: the first column reports log-likelihood ratio score for the genotype and the second column contains the location of a microsatellite found upstream of the site (-1 if no microsatellite found).


User-defined manual tags

One of the features available in the Consed program is the ability to create custom tags. These tags can be used to mark or highlight specific sites or regions on the consensus sequence or on individual sample sequences. For example, following analysis by PolyPhred, the user can manually mark putative SNP sites as verified, or change an incorrect genotype. To create custom tags, the user needs to define the tags in the .consedrc file (see the Consed documentation under the Help menu).

PolyPhred can be set to recognize four types of custom tags, and take an appropriate action when they are encountered. This provides a way for the user to pass information from Consed to the PolyPhred output file. For example, PolyPhred can be set to recognize a custom "verified" tag and report sites marked with this tag type in the VERIFIED block of the output file. In addition, two of the custom tag types, columntag and columnindeltag, can be used to force PolyPhred to report genotypes for all sample sequences at the specified positions.

For PolyPhred to recognize the tags, they must be listed in the .polyphredrc file (see Customizing PolyPhred). Once the .polyphredrc file has been set up, the typical procedure is to 1) assemble the data, 2) run PolyPhred, 3) use Consed to analyze the results, mark sites and make changes, and 4) run PolyPhred again to obtain both the PolyPhred- and user-generated information in the output file.

The tag types are as follows:

manualtag

Tags of this type is used to mark or edit a site in a sample sequence. Typically these tags are used to change the genotype call made by Phred or PolyPhred. Sites marked with these tags are listed in the MANUALGENOTYPE block.

verifiedtag

This tag type is applied to the consensus sequence to indicate that a polymorphic site is verified. Sites marked with these tags are listed in the VERIFIED block.

columntag

Tags of this type are applied to the consensus sequence and are used to force PolyPhred to provide SNP genotypes for all of the sample sequences at the tagged sites. Sites marked by these tags are listed in the POLY block, The genotypes in the sample sequence are listed in the COLUMNGENOTYPE block.

columnindeltag

Tags of this type are applied to the consensus sequence and are used to force PolyPhred to provide indel genotypes for all of the sample sequences at the tagged sites. The tags can be used to mark the positions and define the length of indel sites. The tag should "cover" the segment involved in the indel so that PolyPhred can report the indel segment in the output. Sites marked by these tags are listed in the POLYINDEL block, and the genotypes in the sample sequences are listed in the COLUMNINDEL block. The name of the tag that marks the site will be used to indicate the homozygous genotype. The heterozygous genotype can be set in the .polyphredrc file with the 'indelhettag' key-word. If this is not set, PolyPhred will indicate heterozygotes with the label 'heterozygoteIndel'.

indelSite

This tag is added to the consensus sequence. Additional information included in the tag is: The consensus location of the indel, score of the site and the length of the indel.

heterozygoteIndel

This tag is added to the heterozygous genotypes at the site. Additional information included with this tag is: the consensus position, genotype, genotype score and the length of the indel in the read.

homozygoteIndel

This tag is added to the homozygous genotypes at the site.


Installing PolyPhred

  1. To run PolyPhred, you will need the GNU standard C++ library libstdc++.so.6 or later, which is included with gcc 3.4.3 or later. gcc itself is not required to run PolyPhred.
  2. Make sure the following programs are installed:
    phred              version 0.961028 or later
    phrap              version 0.960731 or later
    phd2fasta          version 0.971024 or later
    consed             version 13.0 or later
    
  3. Download the PolyPhred package for the appropriate platform. Put the file in a directory where it is to be unpacked.
  4. Run the command
    tar xvf polyphred.tar.gz
    replacing polyphred.tar.gz with the exact name of the file you downloaded to. This should produce the following files and directories:
    polyphred-VERSION-binary-HOST/
      bin/
        polyphred          the PolyPhred program
        polygen            tool for making PHD and POLY files from ABI chromat files.
        sudophred          tool for making chromat, PHD and POLY files from FASTA files
        phredPhrap         perl script for running phred and phrap together in the correct order.
      doc/
        polyphred.html     this document
    
  5. Move or copy the polyphred, sudophred and phredPhrap files to a directory in your $PATH, such as /usr/local/bin
    cd polyphred-version-binary-host/bin
    cp -vi polyphred polygen sudophred phredPhrap /usr/local/bin
  6. If you already have a copy of phredPhrap and wish to keep it, you must open the phredPhrap file and edit it as follows:
    1. Uncomment (remove the # from) the line
      # $polyPhredExe = "/usr/local/genome/bin/polyphred";
      Make sure the path within the quotes matches the directory in the previous step.
    2. Change the 0 to 1 in the line
      $bUsingPolyPhred = 0;
    3. phredPhrap also contains instructions for running PolyPhred automatically after Phred and Phrap. It is recommended that these lines be inactivated and PolyPhred be run separately. This makes it easier to determine the source when problems occur. To inactivate the lines, remove or comment out the following:
      if ( $bUsingPolyPhred ) {
      
      print
      "\n\n--------------------------------------------------------\n";
      
      print "Now running polyphred for polymorphism
      detection...\n";
      print
      "--------------------------------------------------------\n\n\n";
      
      
      $szPolyPhredFile = $szBaseName . ".polyphred.out";
      $szPolyPhredFile = $szBaseName .
      ".fasta.screen.polyphred.out";
      
      !system( "$polyPhredExe -ace $szAceFileToBeProduced >
      $szPolyPhredFile" ) ||
      die "some problem running $polyPhredExe $!";
      
      }
      

Read the section Customizing PolyPhred, for instructions on customizing Consed.


Running PolyPhred

PolyPhred reads and modifies data files that are generated by the programs Phred and Phrap, and the can be examined by the program Consed. These programs require the sequence data files to be located in a 'work directory' containing three subdirectories called 'chromat_dir, 'phd_dir' and 'edit_dir'. In addition, PolyPhred needs a fourth subdirectory called 'poly_dir'. It is recommended that a separate working directory be created for each data set. For example, if the data set is called "mydata", a directory called mydata can be created:

mkdir mydata

Within this directory, create the four subdirectories as follows:

cd mydata
mkdir chromat_dir edit_dir phd_dir poly_dir

After these directories have been created, move or copy the chromat files to the chromat_dir directory.

If a reference sequence is to be included in the assembly, use the sudophred tool to generate fake chromat, PHD and POLY files.

To assemble the data, cd to the edit_dir directory and run

phredPhrap mydata

The phredPhrap script automatically runs the programs Phred and Phrap consecutively. When the process is complete, there should be several files in the edit_dir, including one with the extension .ace.1 (the ACE file), and several files in the directories phd_dir and poly_dir.

View the assembled sequences in Consed. Further assembly of the data might be required. For information on this process, check the Consed documentation.

Now run PolyPhred. Include any desired flags on the command line. For example:

polyphred -o polyphred.out -s /_ -score 10 -indel 8

The output report can be viewed in a pager or text editor. Use Consed to view or edit the tags PolyPhred has placed on the assembly (see Customizing PolyPhred for more information):

consed

Reducing the false-positive rate

There are three ways to affect the rate of false-positive calls made by PolyPhred. The best method is to use the source genotype resolution function (the -source flag). This method achieves a large reduction in false positives while minimizing the loss of true sites. To use this feature, there should be double-standed coverage (sequencing in both directions) for most or all of the samples. The sequence file (chromat) name should contain a string of contiguous characters that uniquly identify the samples. The identifier is then passed to PolyPhred using the -source flag. PolyPhred can then match reads from the same source. When the genotype calls for two matched reads are in agreement, the resulting score is increased. If the genotypes disagree, PolyPhred chooses the genotype with the greater likelyhood of being correct.

The most direct method is by using the -score flag to set the score threshold. Only sites that receive a score above this threshold are called, so increasing the threshold results in fewer calls.

For those using the using the six-point ranking system, increasing the rank threshold means setting this value to 2 or 1. This will have the same effect as increasing the score threshold to 95 or 99, respectively.

In general, the false-positive SNP call rate tends to increase near the trimmed regions at the ends of a sequence. Therefore, trimming more of the ends will tend to reduce the number of false-positive calls. The length of the trimming is increased by raising the quality threshold, which is set with the -quality flag.

For all of these methods, reducing the number of false-positive calls will also result in an increase in the number of real SNPs that are missed (false negatives). Generally, as one reduces the false-positive rate, the number of false positives that are eliminated is much greater than the number of missed real SNPs. Also, the first real sites that are missed are the rare SNPs, that is, sites with only one or two heterozygotes present in the data set.


Using the polygen tool

The polygen program can be used to create PHD and POLY files using the base calls and quality scores generated by the ABI base-calling software. This method is an alternative to using the Phred base-calling program.

Polygen can be run from either the edit_dir directory of the directory above it (the work directory). To run the program, enter:

polygen

It can also be run from any other directory by using the -dir (or -d) flag to specify the work directory where the data is located, similar to the -dir flag for PolyPhred. For example:

polygen -d ~/my_home_dir/gene_data

The program looks in the chromat_dir directory for the chromat files. It creates a PHD and POLY file for each chromat file that lacks a PHD file. The PHD files are written into the phd_dir directory, and the POLY files are written into the poly_dir directory.

Alternatively, the -list (-l) flag can be used to specify a file containing a list of chromat files. Polygen creates PHD and POLY files from these chromat files instead. Each line in the file should be the name of a file in the chromat_dir, or a path relative to the chromat_dir. If the filename is a single dash, the list is read from standard input instead.

To force polygen to overwrite any existing PHD and POLY files, use the -overwrite (-o) flag.

Run polygen -h or polygen --help to show a list of the options.

Run polygen -v or polygen --version to show the version.


Using the sudophred tool

The sudophred program is a tool that can be used to generate fake chromat, PHD and POLY files from DNA sequences in FASTA format. Fake chromat and PHD files are needed if one wishes to include a reference sequence in the assembly of the data set (see Using a reference sequence). Also, if one wants to compare data from sequence trace (chromat) files with text sequences, the text sequences need to be converted into all three file types.

The sudophred program takes one text file as input. The text file can contain one or more sequences in FASTA format. If one is generating fake data files for a reference sequence, sudophred writes data files for the first sequence only. Otherwise, sudophred will generate data files for each of the sequences in the text file. In either case, the names of the data files are taken from the string that follows the greater-than symbol (>) at the beginning of each sequence.

One way to run sudophred is to put the FASTA file into the edit_dir directory. Sudophred will create each file that it generates in the appropriate directory. That is, the chromat file will be created in the chromat_dir directory, the phd file in phd_dir, and the poly file in poly_dir. One can also put the FASTA file in an arbitrary directory and run sudophred from there. In this case, or if the normal directories cannot be found, sudophred will write all of the files into that same directory. The files must then be moved to the appropriate data subdirectories. In either case, it is easiest to generate the fake data files before running the phredPhrap program that assembles the data into contigs.

By default, sudophred writes all three files. The chromat files are written in SCF format. In the phd files, all quality values are set to 59.

To run sudophred, enter:

sudophred filename

where filename is the name of the text file containing the sequences. The file name must always be the first argument.

To use sudophred to generate files a reference sequence, use the -r flag. This flag can be followed by a string that PolyPhred will use to identify the reference sequence. For example,

sudophred filename -r .XYZ

will instruct sudophred to create a sequence whose name ends with ‘.XYZ’. If no string is supplied, sudophred will use the default string ‘.REF’.

To change the quality threshold, use the -q flag followed by the value (an integer from 0 to 59). For example:

sudophred filename -q 20

To write the chromat files in ABI format, use the -abi flag:

sudophred filename -abi

Run "sudophred -h" or "sudophred -help" to show a list of the options.

Run "sudophred -v" or "sudophred -version" to show the version.


Using a reference sequence

For the purpose of locating SNPs and other features on a standard sequence map, it is useful to include the standard, or reference sequence in the data assembly. One can then run PolyPhred with the -ref flag to obtain the SNP positions relative to that reference sequence. Further more, one might want to have PolyPhred compare the sample sequences with the reference sequence rather than with the consensus sequence that is generated by Phrap. This can be done by running PolyPhred with the -refcomp flag.

When the either the -ref or -refcomp flag is used, PolyPhred reports in the output file two positions rather than one. The blocks displaying this alternate format are the POLY, GENOTYPE, COLUMNGENOTYPE, INDEL, POLYINDEL, COLUMNINDEL, MANUALGENOTYPE, VERIFIED and MICROSATELLITE. In each block, the first number is the position of the feature relative to the consensus sequence, and the second is the position relative to the reference sequence.

To include a reference sequence in the assembly, one should first create the necessary data files from the reference sequence. These files can be generated with the sudophred program supplied with PolyPhred (see Using the sudophred tool).

Use sudophred with the -r flag to generate the reference sequence data files. For example,

sudophred filename -r

where filename is the name of the text file containing the reference sequence in FASTA format. The data files will be given names that begin with the string that follows the '>' at the beginning of the sequence, followed by the default reference identifier ".REF". In this case, one would run PolyPhred with the reference options as follows:

polyphred -ref

To specify a different reference identifier, follow the -r flag with the identifier string. For example, to set the reference identifier as ‘xYZ’, run:

sudophred filename -r xYZ
In this case, the data files will contain the string "xYZ" in the file names, rather than ".REF", and it will be necessary to select the reference option as follows:
polyphred -ref xYZ

Customizing PolyPhred

PolyPhred can be customized to suit the preferences of the user by creating a .polyphredrc file. The .polyphredrc file allows the user to change default parameter values, as well as specify any manual tags that PolyPhred should capture and written in the output report. This file is optional, and if it is not present, PolyPhred will used its built-in default parameter values and will not capture manual tags.

When PolyPhred starts, it looks for a .polyphredrc file in three locations. It first looks in the user's current directory. If the file is not found there, PolyPhred looks in the user's home directory. If the file is still not found, PolyPhred looks for a directory in the user's shell rc file. The directory is specified by including in the shell rc file the line:

setenv POLYPHRED_PATH [path]

where [path] is the directory containing the .polyphredrc file.

Each line in the .polyphredrc file can be either a blank line, a line beginning with a '#' character, indicating a comment, or with one of the following key-words:

flag

The 'flag' key-word can used with any of the command-line flags to change a default value. For example, to will change the default score threshold to 80 and the quality threshold to 30, enter these lines in the .polyphredrc file:

flag -score 80
flag -q 30

outputfile

The following line

flag -output out.txt

changes two defaults; it will set the name of the output file to 'out.txt' and cause PolyPhred to write the output in a file with that name rather than to the screen. To change the default file name but keep output to the screen as the default activity, use the 'outputfile' key-word, as:

outputfile out.txt

Then, to use the new default output file name, run ‘polyphred -o on’.

navfile

Similarly, the both lines below change the default name of the navigation file, but the first line causes PolyPhred to write a navigation file by default, while the second line leaves the default activity off. flag -nav [file name] navfile [file name]

refID

All three lines below change the default reference sequence identifier. The first two lines turn on ref and refcomp modes, respectively, while the third line does not affect the reference mode. Note that sequences containing the reference identifier are excluded from regular processing, even when ref and refcomp modes are disabled.

flag -ref identifier
flag -refcomp identifier
refID identifier

ranks

The 'ranks' key-word allows the user to change the values used to convert scores to ranks (see How PolyPhred scores SNP sites). For example, the following line:

ranks 90 80 60 40 20

will result in these conversions:

Rank to score conversion example
Probability Rank
99-90 1
89-80 2
79-60 3
59-40 4
39-20 5
19-0 6

acedir, phddir, polydir

The 'acedir', 'phddir' and 'polydir' key-words allow the user to set the locations for the data files to directories other than the ones that required by Phred, Phrap and Consed. The 'acedir' sets the location of the ace file (which is normally in the edit_dir directory). The 'phddir' and 'polydir' key-words specify the locations of the phd and poly files, respectively. A directory is considered to be within the work directory, unless an absolute path starting with '/' is given. Use a '.' to indicate that a directory is the same as the work directory.

date

The 'date' key-word allows the user to set the format of the date that appears at the top of the output file. The key-word must be followed by one of six format codes:

Date format codes
Format code Example
DMY 31/12/07
MDY 12/31/07
YMD 07/12/31
DMYY 31/12/2007
MDYY 12/31/2007
YYMD 2007/12/31

The default is the DMY format.

verifiedtag, columntag, indelhettag, manualtag

Four of the key-words set tag names for the four tag types (see User-defined manual tags). Each tag type can have more than one name (see the example .polyphredrc file below). In addition, the indelhettag key-word allows the user to specify the tag that will be used to indicate heterozygous indels.

Here is an example of a .polyphredrc file:

# PolyPhred configuration file

# Set the date format to YYYY-MM-DD
date YYMD

flag -q 25      # Quality threshold
flag -f 16      # Flanking length

# Send output to edit_dir/report.txt
outputfile report.txt

# Treat files containing `.refSeq' as reference sequences
refID .refSeq

# Manual tags to read from ACE file and include in output report
verifiedtag    polymorphism
columntag      manualGenotype
columnindeltag indel:++
columnindeltag indel:--
indelhettag    indel:+-
manualtag      heterozygote
manualtag      homozygote
manualtag      indel

Customizing PolyGen

Polygen can be customized in a manner similar to PolyPhred (see Customizing PolyPhred). In this case, polygen reads settings from a file named .polygenrc. This file can be stored either in the user's current directory, in the user's home directory, or in the directory specified by the POLYPHRED_PATH environmental variable.

As with PolyPhred, the flag key-word can be used to set any of polygen's flags.

The 'chromatdir', 'phddir' and 'polydir' key-words allow the user to set the locations for the data files to directories other than the ones that required by Phred, Phrap and Consed. The 'chromatdir' sets the location of the chromat files. The 'phddir' and 'polydir' key-words specify the locations of the phd and poly files, respectively. Each directory path given is assumed to be relative to the work directory, unless an absolute path is given (starts with a '/'). Use a '.' to indicate that a directory is the same as the work directory.


Whom to contact with questions and problems

If you have questions or problems with Phred, Phrap or Consed, or you need to obtain these programs, please see the web site at:
http://www.phrap.org

If you have questions, comments, or bug reports regarding PolyPhred, please:

  1. read this documentation carefully; you can find the most recent version of this document at http://droog.gs.washington.edu/PolyPhred.html
  2. If your issue remains unresolved, please email polyphred at u dot washington dot edu. Be as specific as possible. You should indicate which platform and version of PolyPhred you are using, steps to reproduce the problem, what behavior you expected, and what platform you are running on.
  3. Please do not email questions to the webmaster.


References

      1. Kwok, P.Y., Carlson, C., Yager, T.D., Ankenar, W., and Nickerson, D.A., 1994
         "Comparative analysis of human DNA variations by fluorescence-based sequencing 
         of PCR products", Genomics 25, 615-622.

      2. Nickerson, D.A., Tobe, V.O., and Taylor, S.L, 1997, "PolyPhred: automating the 
         detection and genotyping of single nucleotide substitutions using fluorescence-based 
         resequencing", Nucleic Acids Research, 25: 2745-2751.

      3. Ewing, B., Hillier, L., Wendl, M.,  and Green, P., 1998, "Basecalling of automated 
         sequencer traces using phred.  I. Accuracy assesment", Genome Research 8: 175-185.

      4. Ewing, B. and Green, P., 1998, "Basecalling of automated sequencer traces using 
         phred.  II. Error probabilities", Genome Research 8: 186-194.  

      5. Green, P., 1994, Phrap, unpublished.  http://www.phrap.org

      6. Gordon, D., Abajian, C., and Green, P., 1998, "Consed: A grapical tool for sequence 
         finishing", Genome Research 8:195-202.

      7. Stephens M, Sloan JS, Robertson PD, Scheet P, Nickerson DA., 2006, "Automating 
         sequence-based detection and genotyping of SNPs from diploid samples," 
         Nat Genet. 2006 Mar;38(3):375-81. Epub 2006 Feb 19.

      8. Bhangale T., Stephens M., Nickerson DA., 2006, "Automating resequencing-based detection 
         of insertion-deletion polymorphisms" (submitted).