Documentation for PolyPhred Version 5.0

Last modified: 2006/02/03

Program: PolyPhred
Version: 5.0
Copyright (C) 2005-2007
by Deborah A. Nickerson, Scott Taylor, Natali Kolker, Jim Sloan and Matthew Stephens
University of Washington

All rights reserved.

This software is part of a test version of the PolyPhred distribution package. It may not be redistributed, distributed in modified form, or used for any commercial purpose, including commercially funded sequencing, without written permission from the authors and the University of Washington.
This software is provided "as is" and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In particular, this disclaimer applies to any diagnostic purpose. In no event shall the authors or the University of Washington be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substitute goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.

Description of Features
Setup and Operating Instructions
More Information
- Who to contact with questions and problems
- References

Introduction

Single nucleotide polymorphisms (SNPs) are the most frequent form of DNA sequence variation in the human genome. The identification and typing of these variations plays a central role in analyzing the relationships between genome structure and function, and in understanding the allelic variation within and among populations.

Many techniques are used to identify sequence variants among different individuals using DNA amplified by the polymerase chain reaction (PCR). These include denaturing gel electrophoresis, chemical or enzymatic cleavage, heteroduplex analysis, the analysis of single-stranded DNA conformations, variant detector arrays, and direct sequencing of a PCR product. PolyPhred is a program that helps to accurately identify heterozygous sites in sequences produced by sequencing PCR products with fluorescence-based chemistries such as dye labeled terminators or dye-labeled primers. The program compares sequence traces and searches for homozygotes and heterozygotes.

Identification of potential heterozygous sites is based on 1) the presence of two significant overlapping fluorescence peaks at such sites in the sequence trace, and 2) detecting a decrease of about 50% in the peak heights when the sequence trace is compared with that obtained from homozygous individuals (references 1 and 2). PolyPhred scans for these two features when sequence traces are being compared to detect heterozygotes among homozygotes (reference 2). In addition, if double-stranded coverage of the reads is provided, the accuracy of the results is significantly increased.

PolyPhred is not a stand alone program. It is designed as a member of an integrated suite of sequence analysis applications that includes the programs Phred (references 3,4), Phrap (reference 5), and Consed (reference 6).

How PolyPhred works

PolyPhred identifies potential heterozygous sites by comparing traces in a sequence assembly. Phred provides the base-calls, base quality information and the peak size information, which is stored in two types of files called PHD and POLY files. Phrap is used to assemble the input sequences into one or more contigs, and to derive a consensus sequence for each contig. The assembly information is stored in a file called the ACE file. PolyPhred uses all three file types to analyze the sequence traces. It first reads the ACE file to obtain the consensus sequence and the names of the trace (chromat) files used in the assembly. It then reads the PHD and POLY files associated with each trace.

During the SNP search phase, PolyPhred combines information from all of the sequence traces to derive a genotype and a score for each sequence (see How PolyPhred scores SNP sites). It also uses a standard sequence for comparison to identify sites that are homozygous for a minor or alternative allele. The score indicates how well the trace at the site matches the expected pattern for a SNP. After PolyPhred identifies the putative polymorphic sites, it updates the ACE and PHD files by adding tags that mark the positions of the sites. The tagged sites can then be examined using the program Consed. PolyPhred also generates a detailed output that lists the positions, genotypes and scores of the polymorphic sites in a format that can be easily parsed into a database program.

Part of the process of computing score involves averaging certain values across the reads in the assembly. For small assemblies, the accuracy of these averages increases with the number of overlapping sequences. This in turn increases the reliability of the results. We recommend that the region of interest should be covered by at least eight independent sequences, if possible.

A significant increase in the rate of true-positive SNPs can achieved by sequencing each sample in both directions. PolyPhred combines double-stranded information to enhance the accuracy of its genotype calls. To take advantage of this feature, it is necessary to use a sensible naming convention when naming the sequence data files. The sequence file names should contain a contiguous set of characters that identify the individual source. Using the -source flag (see below), PolyPhred can then match sequences that are from the same source (see also Reducing the false-positive rate).

The flags

Many of the flags have an abbreviated form, which are shown in parentheses. Most of the flags take an argument, which is shown in square brackets ([ ]). For some flags, the argument is optional. In these cases, the argument is indicated in green, and a default value if the argument is omitted is shown.

All of the flags are optional. Each description indicates the argument value or action taken if the flag is omitted.

-ace (-a) [ace file]
Use this flag to specify the ACE file to be read by PolyPhred.
• If omitted: the most recent ACE file is used.

-block [list of block names]
Use this flag to include or exclude blocks from the output file. The valid block names are POLY, GENOTYPE, COLUMNGENOTYPE, INDEL, POLYINDEL, COLUMNINDEL, MANUALGENOTYPE, VERIFIED, MICROSATELLITE, SAMPLE and COVERAGE. To include a block, precede the block name with a plus sign (+). To exclude a block, precede the block name with a minus sign (-). For example, to exclude the SAMPLE and COVERAGE blocks from the output report, add this to the command line:

  -block -SAMPLE -COVERAGE

• If omitted: all blocks except MICROSATELLITE are included in the output.

-clear
Use this flag to remove all PolyPhred tags from the ACE and PHD files.
• If omitted: normal operation

-dir (-d) [work directory]
Use this flag to specify the location of the data. The flag allows PolyPhred to be run from a directory other than the one containing the data to be analyzed (see Running PolyPhred).
• If omitted: PolyPhred must be run either from the edit_dir directory or from the data directory of the data to be analyzed.

-flanking (-f) [number]
Use this flag to specify the number of bases flanking the polymorphic sites reported in the POLY and POLYINDEL blocks of the PolyPhred output.
• Accepted numbers: 0 - 50
• If omitted: 10

-group (-g) [regular expression]
This flag specifies a subset of the files to be used in the analysis. PolyPhred analyzes only those sequences with a name that matches the regular expression.
• If omitted: .+ (e.g., sall files)

-help (-h)
Use this flag to see information on how to use PolyPhred. The flags are listed along with their allowed and default values.
• If omitted: normal operation

-indel (-i) [on / off]
Use this flag to switch on or off the search for indel polymorphisms. See Detection of insertion/deletion polymorphisms.
• Default argument: on
• If omitted: off

-ms [x / on / off]
Use this flag to switch on or off the marking of simple microsatellite repeats. If the argument 'x' is passed, putative SNP sites that are found within microsatellites are given a score equal to the score limit (see -score).
• Default argument: on
• If omitted: off

-nav (-n) [file name / on / off]
Use this flag to generate a navigation file listing the polymorphic sites. If the file name is given but does not have a final ".nav" extension, PolyPhred adds one. The file is written to the edit_dir directory of the working directory.
• Default argument: on, using the file name "polyphred.nav"
• If omitted: off

To use the navigation file in Consed, click on 'Navigate', located at the top of the 'Consed Main Window'. Then click on 'Custom Navigation'. The window that appears should contain the name of the navigation file. Click on the file name to bring up the navigation window.

-output (-o) [file name / on / off]
Use this flag to send the PolyPhred output either to a file or to the standard output (the screen). If the argument is "off", the output is written to the screen. In this case, the output can be redirected to a file using '>'.
• Default argument: on, using the file name "polyphred.out"
• If omitted: off

-quality (-q) [value]
Use this flag to set the quality threshold. PolyPhred uses the quality threshold to determine the extent of the excluded, or trimmed, regions at the ends of the sample sequences (the regions shaded in yellow when the assembly is viewed in Consed). Reducing this value results in less trimming of the ends. See Reducing the false-positive rate.
• Accepted value: 0 - 50
• If omitted: 25

-rank (-r) [value / on / off]
Use this flag to direct PolyPhred to score sites with the six-point ranking system. To set the rank threshold, follow the flag with a number from 1 to 6. PolyPhred marks and reports only sites that are assigned a rank between 1 and the rank threshold, inclusive. See Reducing the false-positive rate.
•  Accepted value: 1 - 6
•  Default argument: on, using the value 3
•  If omitted: the 100-point scoring system is used.

-ref [reference sequence identifier / on / off]
Use this flag to specify a reference sequence for reporting of polymorphic site positions. In this case, PolyPhred uses the consensus sequence as the standard, rather than the reference sequence (see -refcomp below). See Using a reference sequence.
• Default argument: on, using the identifier ".REF"
• If omitted: off

-refcomp [reference sequence identifier / on / off]
Use this flag to direct PolyPhred to use a reference sequence as the standard rather than the consensus sequence. See Using a reference sequence.
• Default argument: on, using the identifier ".REF"
• If omitted: off

-source (-s) [/delimiter / posn1 posn2 / off]
Use this flag to activate the source genotype resolution function and set the location in the chromat file names of the source identifier, or turn the function off. The source identifier is a contiguous set of characters that uniquely identifies the source of the DNA sample. PolyPhred uses the source identifier to match sequences from the same DNA sample. See Reducing the false-positive rate.

The source identifier can be placed in the chromat file names using either of two methods. One method is to flank the identifier characters with a delimiter. Any valid file name character can serve as the delimiter. When running PolyPhred, indicate the delimiter as follows ('c' is the delimiter character):

  polyphred -s /c

For example, if the chromat file names are of the form: abc-source-xyz.scf
run PolyPhred as:

  polyphred -s /-

The second method for locating the source identifier is to place the identifier characters in a constant location in all chromat file names. Indicate the location of the identifier characters as follows:

  polyphred -s posn1 posn2

For example, if all chromat file names are of the form: abcSOURCExyz.scf
where SOURCE is the location of the identifier characters, from positions 4 to 9, then run PolyPhred as follows:

  polyphred -s 4 9

If the function has been activated in the .polyphredrc file, it can turned off with the 'off' argument.
• If omitted: off

-score [number]
Use this flag to select the 100-point scoring system and set the score threshold. PolyPhred marks and reports only sites that are assigned a score between 99 and the score threshold, inclusive. See Reducing the false-positive rate.
• Accepted numbers: 0 - 99
• If omitted or argument omitted: the 100-point scoring system is used with a score threshold of 70

-snp [het / hom / on / off]
Use this flag to enable or disable SNP detection, or to mark only heterozygous (het) or only homozgous (hom) polymorphisms.
• Default argument: on, marking both heterozygous and homozygous polymorphisms
• If omitted: on

-tag (-t) [tag type]
Use this flag to specify the tag type to use for viewing SNP sites in Consed. The three tag types are "genotype", "polymorphism", and "rank". The tag types can be abbreviated as g, p and r, respectively. Using the genotype tag results in putative polymorphic sites marked on the consensus sequence with color-coded tags indicating rank, and putative SNPs marked with pink tags on the sample sequences. Using the rank tag results in color-coded tags indicating rank placed on both the consensus sequence and the sample sequences (see How PolyPhred scores SNP sites for the color codes.) Using the polymorphism tag results in a blue tag placed on all putative polymorphic sites on the consensus sequence and pink tags indicating putative SNPs on the sample sequences.
• If omitted: genotype

-update [on / off]
Use this flag to enable or disable updating of the ACE and PHD files. If updating is switched off, the ACE and PHD files are not updated, and the PolyPhred results can not be viewed in Consed.
• Default argument: on
• If omitted: on

-verbosity (-v) [0 / 1 / 2]
Use this flag to set the degree or amount of progress reporting that will written to the screen as PolyPhred is running. The allowed arguments range from 0 (least reporting) to 2 (most reporting).
• If omitted: 0

-version
Use this flag to view the PolyPhred version and build number.
• If omitted: normal operation

-window (-w) [number]
Use this flag to set the window width. PolyPhred uses the window width, together with the quality threshold, to determine the extent of the excluded, or trimmed, regions at the ends of the sample sequences (the regions shaded in yellow when the assembly is viewed in Consed).
• Accepted numbers: 5 - 50
• If omitted: 20

-xml [on / off]
Use this flag to specify the format of the PolyPhred output.
• Default argument: on -> XML output
• If omitted: off -> Default output format

How PolyPhred scores SNP sites

A SNP site generally appears in the sequence traces as two overlapping peaks with reduced peak heights. Ideally, the areas under these two peaks are nearly the same, and the heights of the peaks are reduced by about a half of what the height of a hypothetical homozygous peak would be at the same position.

When PolyPhred identifies a putative heterozygous site in a sample sequence, it assigns the site a score that indicates how well the traces of the two peaks fit the ideal pattern for a SNP. The score values range from 99 to 0, where 99 indicates a very good fit.

If a site is determined to be homozygous, PolyPhred compares its genotype with that of a standard sequence, which can be either the consensus sequence or a user-specified reference sequence. If the genotypes do not match, the site is marked as a minor or alternative allele.

If the -source flag is used, PolyPhred combines the information in matched reads to increase the accuracy of its genotype calls. Scores for genotypes that are in agreement are increased (see Reducing the false-positive rate).

When all sites at a position (i.e., a column as viewed in Consed) have been assigned a score, PolyPhred calculates an overall score and genotype for the position. This score depends on the highest-scoring site in the sample sequences. If the overall score is greater than or equal to the score threshold (see the -score flag), then PolyPhred marks the position as polymorphic. The number of sites that PolyPhred marks can be controlled by adjusting the score threshold (see Reducing the false-positive rate).

If the six-point ranking system is selected, PolyPhred converts the score to a rank according to the table below. Along with each rank is the color of the tags as displayed in Consed.

The 'True Positive Rate' column shows the percentage of true positive SNPs marked within each rank, as found in our own analysis, using the default -score and -quality settings. These results may very depending on changes in these settings, as well as the quality of the data and number of samples analyzed.

Score Rank Tag Color True Positive Rate

99 1 red 97%

95-98 2 orange 75%

90-94 3 green 62%

70-89 4 dark blue 35%

50-69 5 magenta 11%

0-49 6 purple 1%

The output report

To facilitate parsing of the output file, the report is divided into several blocks. Each block begins with the token BEGIN_BLOCKNAME and ends with END_BLOCKNAME, where BLOCKNAME is the name of the block.

The output report begins with the line BEGIN_MESSAGE and ends with the line END_MESSAGE. The first block within the report is the HEADER block. This block provides the version of PolyPhred that generated the output report, a thumbprint to uniquely identify the output, the date and time the output was generated, and the directory from which PolyPhred was run.

Next is the COMMAND_LINE block. Listed in this block are the user-definable parameters that the users needs to interpret the output report, and to repeat the analysis if needed. This includes the working directory and the ACE file that was used, and those parameters that affect the analysis.

The rest of the report contains results for one or more contigs. The results for each contig are enclosed within the lines BEGIN_CONTIG and END_CONTIG. The line immediately following the BEGIN_CONTIG token provides the name of the contig. The results are then subdivided into several blocks that describe below. The user can specify which blocks appear in the output report by using the -block flag.

If the -ref flag is used, PolyPhred adds an additional field to the POLY, GENOTYPE, COLUMNGENOTYPE, INDEL, POLYINDEL, COLUMNINDEL, MANUALGENOTYPE, VERIFIED and MICROSATELLITE blocks. The extra field, which comes second after the consensus sequence position, is the position relative to a reference sequence.

The POLY block
In this block, the putative SNP sites identified by PolyPhred are listed, as well as sites marked by columntag type tags (see User-defined manual tags). Each line reports the consensus sequence position, the 5' sequence flanking the polymorphic site, the two most common alleles at the site, the 3' sequence flanking the site, and the over-all score assigned to the site.
• XML tag: block-snp_site subtag: snp_site

The GENOTYPE block
In this block, the genotypes of the individual sample sequences are listed for each putative SNP site the POLY block. Each line reports the consensus sequence position, the position relative to the sample sequence, the name of the sample sequence, the two alleles at the position, and the score.
• XML tag: block-snp_genotype subtag: snp_genotype

The COLUMNGENOTYPE block
In this block, the genotypes of the individual sample sequences are listed for each manual-SNP tag applied to the consensus sequence. Each line reports the consensus sequence position, the position relative to the sample sequence, the name of the sample sequence, the two alleles at the position, and the score. PolyPhred obtains the user-defined tags from the .polyphredrc file (see User-defined manual tags).
• XML tag: block-manual_snp subtag: snp_genotype

The INDEL block
If the -indel flag is set to 'on', the putative indel sites are listed in this block. Each line reports the consensus sequence position, the position relative to the sample sequence in which the indel was found, the name of the sample sequence, the genotype ('+-' indicates a heterozygote, '--' indicates a homozygous deletion), and the length of the indel.
• XML tag: block-marked_indel subtag: marked_indel

The POLYINDEL block
In this block, the manual-indel tag sites applied to the consensus sequence are listed. Each line reports the consensus sequence position, the 5' sequence flanking the indel site, the segment involved in the indel, the 3' sequence flanking the site, and the comment if one is present. PolyPhred obtains the user-defined tags from the .polyphredrc file (see User-defined manual tags).
• XML tag: block-indel_site subtag: indel_site
• This is a new block.

The COLUMNINDEL block
In this block, the genotypes of the individual sample sequences are listed for each manual-indel tag listed in the POLYINDEL block. Each line reports the consensus sequence position, the position relative to the sample sequence, the name of the sample sequence, and the genotype. The tag used to specify the genotype can be user-defined in the .polyphredrc file (see User-defined manual tags).
• XML tag: block-manual_indel subtag: manual_indel
• This is a new block.

The MANUALGENOTYPE block
In this block, Sample sequence sites that have been tagged manually are listed. Each line reports the consensus sequence position of a tagged site, the position relative to the sample sequence that was tagged, the identity of the tag, and the comment if one is present.
PolyPhred obtains the user-defined tags from the .polyphredrc file (see User-defined manual tags).
• XML tag: block-manual_genotype subtag: manual_genotype

The VERIFIED block
In this block, sites manually tagged as verified are listed. Each line reports the consensus sequence position and the tag identity. PolyPhred obtains the user-defined tags from the .polyphredrc file (see User-defined manual tags).
• XML tag: block-verified_site subtag: verified_site

The MICROSATELLITE block
If the -ms flag is set to 'on', this block lists that microsatellite sequences that were found. Each line reports the consensus sequence position of the 5' end of the microsatellite and the repeat pattern.
• XML tag: block-microsatellite subtag: microsatellite
• This is a new block.

The SAMPLE block
The names of the sample sequences that were analyzed and their sequence qualities are listed in this block. Each line reports the name of a sequence, the positions of the left and right boundaries of the search region (between the trimmed ends), and the average site quality, as determined by Phred, within the search region.
• XML tag: block-sample_quality subtag: sample_quality

The COVERAGE block
This block provides a tally of the number of sample sequences that PolyPhred examined at each position. Each line reports the begin and end positions of a range relative to the consensus sequence, followed by the number of sample sequences that were analyzed in the range.
• XML tag: block-coverage subtag: coverage

Detection of insertion/deletion polymorphisms

The indel detection feature can be enabled with the -indel flag. PolyPhred identifies sample sequences with putative heterozygous indels, as well as sequences with a deletion greater than two bases relative to the consensus sequence.

When PolyPhred identifies an indel site, it marks it on the consensus sequence with an 'indelSite' tag. Sample sequences containing the indel are marked with a 'heterozygoteIndel' tag, while those that do not are marked with a 'homozygoteIndel' tag.

PolyPhred is sometimes inaccurate in determining the positions and lengths of indels. Therefore, a manual tagging system is provided for marking the correct positions and lengths of indels. The corrected positions and lengths will be reported in the PolyPhred output (see User-defined manual tags).

Versions of Consed prior to version 13.0 are not able to interpret the indel tags. To solve this problem, it is necessary to modify the .consedrc file. Add the following lines to the .consedrc file:

  consed.customConsensusTag1: indelSite
  consed.tagColorCustomConsensusTag1: DarkCyan
  consed.customTag1: indel
  consed.tagColorCustomTag1: DarkOrange

If the 'customConsensusTag1' and 'customTag1' tags are already used, change the final number 1 in the tag names to the next available number.

User-defined manual tags

One of the features available in the Consed program is the ability to create custom tags. These tags can be used to mark or highlight specific sites or regions on the consensus sequence or on individual sample sequences. For example, following analysis by PolyPhred, the user can manually mark putative SNP sites as verified, or change an incorrect genotype. To create custom tags, the user needs to define the tags in the .consedrc file (see the Consed documentation under the Help menu).

PolyPhred can be set to recognize four types of custom tags, and take the appropriate action when they are encountered. This provides a way for the user to pass information from Consed to the PolyPhred output file. For example, PolyPhred can be set to recognize a custom "verified" tag and report sites marked with this tag type in the VERIFIED block of the output file. In addition, two of the custom tag types, columntag and columnindeltag, can be used to force PolyPhred to report genotypes for all sample sequence at specified positions.

For PolyPhred to recognize the tags, they must be listed in the .polyphredrc file (see Customizing PolyPhred). Once the .polyphredrc file has been set up, the typical procedure is to 1) assemble the data, 2) run PolyPhred, 3) use Consed to analyze the results, mark sites and make changes, and 4) run PolyPhred again to obtain both the PolyPhred- and user-generated information in the output file.

The four tag types are:

manualtag
Tags of this type is used to mark or edit a site in a sample sequence. Typically these tags are used to change the genotype call made by Phred or PolyPhred. Sites marked with these tags are listed in the MANUALGENOTYPE block.

verifiedtag
This tag type is applied to the consensus sequence to indicate that a polymorphic site is verified. Sites marked with these tags are listed in the VERIFIED block.

columntag
Tags of this type are applied to the consensus sequence and are used to force PolyPhred to provide SNP genotypes for all of the sample sequences at the tagged sites. Sites marked by these tags are listed in the POLY block, The genotypes in the sample sequence are listed in the COLUMNGENOTYPE block.

columnindeltag
Tags of this type is applied to the consensus sequence and are used to force PolyPhred to provide indel genotypes for all of the sample sequences at the tagged sites. The tags can be used to mark the positions and define the length of indel sites. The tag should "cover" the segment involved in the indel so that PolyPhred can report the indel segment in the output. Sites marked by these tags are listed in the POLYINDEL block, and the genotypes in the sample sequences are listed in the COLUMNINDEL block. The name of the tag that marks the site will be used to indicate the homozygous genotype. The heterozygous genotype can be set in the .polyphredrc file with the 'indelhettag' key-word. If this is not set, PolyPhred will indicate heterozygotes with the label 'heterozygoteIndel'.

Installing PolyPhred

Make sure the following programs are installed:

  phred              version 0.961028 or later
  phrap              version 0.960731 or later
  phd2fasta          version 0.971024 or later
  consed             version 13.0 or later

Download the PolyPhred package for the appropriate platform. Put the file in a directory where it is to be unpacked.

Run "tar xzf polyphred.tar.gz". This should produce the following files and directories:

  polyphred          the PolyPhred program
  polygen            tool for making PHD and POLY files from ABI chromat files.
  sudophred          tool for making chromat, PHD and POLY files from FASTA files
  polyphred.html     this document
  phredPhrap         perl script for running phred and phrap together in the correct order.

Move or copy the polyphred, sudophred and phredPhrap files to the directory from which they will be run, such as: /usr/local/genome/bin/.
If you already have a copy of phredPhrap and wish to keep it, you must open the phredPhrap file and edit it as follows:
1. Uncomment the following line by removing the '#':
  # $polyPhredExe = "/usr/local/genome/bin/polyphred";
  Make sure the path within the quotes matches the directory in the previous step.
2. Change the 0 to 1 in the line
  $bUsingPolyPhred = 0;
3. phredPhrap also contains instructions for running PolyPhred automatically after Phred and Phrap. It is recommended that these lines be deactivated and PolyPhred be run separately. This makes it easier to determine the source when problems occur. To deactivate the lines, remove or comment out the following:
  if ( $bUsingPolyPhred ) { print "\n\n--------------------------------------------------------\n"; print "Now running polyphred for polymorphism detection...\n"; print "--------------------------------------------------------\n\n\n"; $szPolyPhredFile = $szBaseName . ".polyphred.out"; $szPolyPhredFile = $szBaseName . ".fasta.screen.polyphred.out"; !system( "$polyPhredExe -ace $szAceFileToBeProduced > $szPolyPhredFile" ) || die "some problem running $polyPhredExe $!"; }

Read the section Customizing PolyPhred, as well as the section Detection of insertion/deletion polymorphisms for instructions on customizing Consed.

Running PolyPhred

PolyPhred reads and modifies data files that are generated by the programs Phred and Phrap, which can then be examined by the program Consed. These programs require the sequence data files to be located in a work directory containing three subdirectories called 'chromat_dir, 'phd_dir' and 'edit_dir'. In addition, PolyPhred needs a fourth subdirectory called 'poly_dir'. It is recommended that a separate working directory be created for each data set. For example, if the data set is called "mydata", a directory called mydata can be created:

  mkdir mydata

Within this directory, create the four subdirectories as follows:

  cd mydata
  mkdir chromat_dir edit_dir phd_dir poly_dir

After these directories have been created, move or copy the chromat files to the chromat_dir directory.

If a reference sequence is to be included in the assembly, use the sudophred tool to generate fake chromat, PHD and POLY files.

To assemble the data, change to the edit_dir directory and run "phredPhrap mydata". The program phredPhrap automatically runs the programs Phred and Phrap consecutively. When the process is complete, there should be several files in the edit_dir, including one with the extension .ace.1 (the ACE file), and several files in the phd_dir and poly_dir directories.

View the assembled sequences in Consed. Further assembly of the data might be required. For information on this process, check the Consed documentation.

Run "polyphred". Include any desired flags on the command line.

Use Consed to view the polymorphic sites tagged by PolyPhred (see Customizing PolyPhred).

Reducing the false-positive rate

There are three ways to adjust the rate of false-positive calls made by PolyPhred. The best method is to use the source genotype resolution function (the -source flag). This method achieves a large reduction in false positives while minimizing the loss of true sites. To use this feature, there should be double-standed coverage (sequencing in both directions) for most or all of the samples. The sequence file (chromat) name should contain a string of contiguous characters that uniquely identify the samples. The identifier is then passed to PolyPhred using the -source flag. PolyPhred can then match reads from the same source. When the genotype calls for two matched reads are in agreement, the resulting score is increased. If the genotypes disagree, PolyPhred chooses the genotype with the greater likelihood of being correct.

The most direct method is by using the -score flag to set the score threshold. Only sites that receive a score above this threshold are called, so increasing the threshold results in fewer calls.

For those using the using the six-point ranking system, increasing the rank threshold means setting this value to 2 or 1. This will have the same effect as increasing the score threshold to 95 or 99, respectively.

In general, the false-positive SNP call rate tends to increase near the trimmed regions at the ends of a sequence. Therefore, trimming more of the ends will tend to reduce the number of false-positive calls. The length of the trimming is increased by raising the quality threshold, which is set with the -quality flag.

For all of these methods, reducing the number of false-positive calls will also result in an increase in the number of real SNPs that are missed (false negatives). Generally, as one reduces the false-positive rate, the number of false positives that are eliminated is much greater than the number of missed real SNPs. Also, the first real sites that are missed are the rare SNPs, that is, sites with only one or two heterozygotes present in the data set.

Using the polygen tool

The polygen program is a tool that create PHD and POLY files using the base calls and quality scores generated by the ABI base-calling software. This method is an alternative to using the Phred base-calling program.

Polygen can be run from either the edit_dir directory of the directory above it (the work directory). To run the program, enter:

  polygen

It can also be run from any directory by using the -dir (or -d) flag to specify the work directory where the data is located, similar to the -dir flag for polyphred. For example:

  polygen -d ~/my_home_dir/gene_data

The program looks in the chromat_dir directory for the chromat files. It creates a PHD and POLY file for each chromat file that lacks a PHD file. The PHD files are written into the phd_dir directory, and the POLY files are written into the poly_dir directory.

Alternatively, the -list (-l) flag can be used to specify a file containing a list of chromat files. Polygen creates PHD and POLY files from these chromat files instead.

To force polygen to overwrite any existing PHD and POLY files, use the -overwrite (-o) flag.

Run "polygen -h" or "polygen -help" to show a list of the options.

Run "polygen -v" or "polygen -version" to show the version.

Using the sudophred tool

The sudophred program is a tool that can be used to generate fake chromat, PHD and POLY files from DNA sequences in FASTA format. Fake chromat and PHD files are needed if one wishes to include a reference sequence in the assembly of the data set (see Using a reference sequence). Also, if one wants to compare data from sequence trace (chromat) files with text sequences, the text sequences need to be converted into all three file types.

The sudophred program takes one text file as input. The text file can contain one or more sequences in FASTA format. If one is generating fake data files for a reference sequence, sudophred writes data files for the first sequence only. Otherwise, sudophred will generate data files for each of the sequences in the text file. In either case, the names of the data files are taken from the string that follows the '>' at the beginning of each sequence.

One way to run sudophred is to put the FASTA file in an edit_dir directory. Sudophred will write each file that it generates into the appropriate directory. That is, sudophred writes the chromat file in the chromat_dir directory, the phd file in phd_dir, and the poly file in poly_dir. One can also put the FASTA file in an arbitrary directory and run sudophred from there. In this case, sudophred will write all of the files into that same directory. The files must then be moved to the appropriate data subdirectories. In either case, it is easiest to generate the fake data files before running the phredPhrap program that assembles the data into contigs.

By default, sudophred writes all three files. The chromat files are written in SCF format. In the phd files, the quality values are all 59.

To run sudophred, enter:

  sudophred [filename]

where filename is the name of the text file containing the sequences. The file name must always be the first argument.

To use sudophred to generate reference sequence files, use the -r flag. This flag can be followed by a string that PolyPhred will append to the reference sequence file name. For example:

  sudophred [filename] -r .XYZ

If no string is supplied, sudophred will use the default string .REF

To change the quality threshold, use the -q flag followed by the value (an integer from 0 to 59). For example:

  sudophred [filename] -q 20

To write the chromat files in ABI format, use the -abi flag.

  sudophred [filename] -abi

Run "sudophred -h" or "sudophred -help" to show a list of the options.

Run "sudophred -v" or "sudophred -version" to show the version.

Using a reference sequence

For the purpose of locating SNPs and other features on a standard sequence map, it is useful to include the standard, or reference, sequence in the data assembly. One can then run PolyPhred with the -ref flag to obtain the SNP positions relative to that reference sequence. To have PolyPhred compare the sample sequences with the reference sequence rather than with the consensus sequence that is generated by Phrap, run PolyPhred with the -refcomp flag.

When the either the -ref or -refcomp flag is used, PolyPhred reports two positions in the output file rather than one. The blocks displaying this alternate format are the POLY, GENOTYPE, COLUMNGENOTYPE, INDEL, POLYINDEL, COLUMNINDEL, MANUALGENOTYPE, VERIFIED and MICROSATELLITE. In each block, the first number is the position of the feature relative to the consensus sequence, and the second is the position relative to the reference sequence.

To include a reference sequence in the assembly, first create the necessary data files from the reference sequence using the sudophred program supplied with PolyPhred (see Using the sudophred tool).

Use sudophred with the -r flag to generate the reference sequence data files. For example,

  sudophred [filename] -r

where filename is the name of the text file containing the reference sequence in FASTA format. The data files will be given names that begin with the string that follows the '>' at the beginning of the sequence, followed by the default reference identifier ".REF". In this case, one would run PolyPhred with the reference options as follows;

  polyphred -ref

To specify a different reference identifier, follow the -r flag with the identifier string. For example, to set the reference identifier as "xYZ", run:

  sudophred [filename] -r xYZ

In this case, the data files will contain the string "xYZ" in the file names, rather than ".REF", and it will be necessary to select the reference option as follows:

  polyphred -ref xYZ

Customizing PolyPhred

PolyPhred can be customized by creating a .polyphredrc file. The .polyphredrc file allows the user to change default parameter values, as well as specify any manual tags that PolyPhred should capture and write to the output report. This file is optional, and if it is not present, PolyPhred will used its built-in default parameters and will not capture manual tags.

When PolyPhred starts, it looks for a .polyphredrc file in three locations. It first looks in the user's current directory. If the file is not found there, PolyPhred looks in the user's home directory. If the file is still not found, PolyPhred looks in the user's shell rc file. The directory is specified by setting the environment variable to the directory containing the .polyphredrc file. For example, in the C Shell, the directory is specified by including in the .cshrc file the line:

  setenv POLYPHRED_PATH [path]

where [path] is the directory containing the .polyphredrc file.

Each line in the .polyphredrc file can be either a blank line, a line beginning with a '#' character, indicating a comment, or with one of the following key-words:

flag
The 'flag' key-word can used with any of the command-line flags to change a default value. For example, to will change the default score threshold to 80 and the quality threshold to 30, add these lines to the .polyphredrc file:

  flag -score 80
  flag -q 30

outputfile
The following line

  flag -output out.txt

changes two defaults; it will set the name of the output file to 'out.txt' and cause PolyPhred to write the output in a file with that name rather than to the screen. To change the default file name but keep output to the screen as the default activity, use the 'outputfile' key-word, as:

  outputfile out.txt

Then, to use the new default output file name, run "polyphred -o on".

navfile
Similarly, both of the following lines below change the default name of the navigation file:

flag -nav [file name]

causes PolyPhred to write a navigation file by default;

navfile [file name]

line leaves the default activity off.

refID
All three lines below change the default reference sequence identifier. The first two lines turn their functions on by default, while the third line leaves the default activities off.

  flag -ref [identifier]
  flag -refcomp [identifier]
  refID [identifier]

ranks
The 'ranks' key-word allows the user to change the values used to convert scores to ranks (see How PolyPhred scores SNP sites). For example, the following line:

  ranks 90 80 60 40 20

will result in these conversions:

Probability	Rank
99-90	1
89-80	2
79-60	3
59-40	4
39-20	5
19-0	6

acedir, phddir, polydir
The 'acedir', 'phddir' and 'polydir' key-words allow the user to set the locations for the data files to directories other than the ones that required by Phred, Phrap and Consed. The 'acedir' sets the location of the ace file (which is normally in the edit_dir directory). The 'phddir' and 'polydir' key-words specify the locations of the phd and poly files, respectively. Directories are considered to be within the work directory, unless an absolute path (starting with a '/') is given. Use a single dot (.) to indicate that a directory is the same as the work directory.

date
The 'date' key-word allows the user to set the format of the date that appears at the top of the output file. The key-word must be followed by one of six format codes:

2-digit year 4-digit year format

DMY DMYY day/month/year

MDY MDYY month/day/year

YMD YYMD year/month/day

The default is the DMY format.

verifiedtag, columntag, indelhettag, manualtag
Four of the key-words set tag names for the four tag types (see User-defined manual tags). Each tag type can have more than one name (see the example .polyphredrc file below). In addition, the indelhettag key-word allows the user to specify the tag that will be used to indicate heterozygous indels.

Here is an example of a .polyphredrc file:

  date YYMD
  flag -q 25
  flag -f 16

  outputfile report.txt
  refID .refSeq

  # Manual Tags
  verifiedtag    polymorphism
  columntag      manualGenotype
  columnindeltag indel:++
  columnindeltag indel:--
  indelhettag    indel:+-
  manualtag      heterozygote
  manualtag      homozygote
  manualtag      indel

Customizing PolyGen

Polygen can be customized in a manner similar to PolyPhred (see Customizing PolyPhred). In this case, a file named .polygenrc contains the preferences. This file can be stored either in the user's current directory, in the user's home directory, or in the directory specified by the POLYPHRED_PATH environmental variable.

As with PolyPhred, the flag key-word can be used to set any of polygen's flags.

The 'chromatdir', 'phddir' and 'polydir' key-words allow the user to set the locations for the data files to directories other than the ones that required by Phred, Phrap and Consed. The 'chromatdir' sets the location of the chromat files. The 'phddir' and 'polydir' key-words specify the locations of the phd and poly files, respectively. A directory are considered to be within the work directory, unless an absolute path is given (starts with a '/'). Use a '.' to indicate that a directory is the same as the work directory.

Who to contact with questions and problems

If you have questions or problems with Phred, Phrap or Consed, or need to obtain these programs, please visit the web site:
http://www.phrap.org

If you have questions or problems with PolyPhred, or to report bugs, please

read this documentation carefully;
go to this web site: http://chum.gs.washington.edu
Follow the "PolyPhred" link for the email address of the person to contact. Please do not email questions to the web master.

If you discover an error in PolyPhred, please follow step 2 above.

References

1. Kwok, P.Y., Carlson, C., Yager, T.D., Ankenar, W., and Nickerson, D.A., 1994
   "Comparative analysis of human DNA variations by fluorescence-based sequencing 
   of PCR products", Genomics 25, 615-622.

2. Nickerson, D.A., Tobe, V.O., and Taylor, S.L, 1997, "Polyphred: automating the 
   detection and genotyping of single nucleotide substitutions using fluorescence-based 
   resequencing", Nucleic Acids Research, 25: 2745-2751.

3. Ewing, B., Hillier, L., Wendl, M.,  and Green, P., 1998, "Basecalling of automated 
   sequencer traces using phred.  I. Accuracy assesment", Genome Research 8: 175-185.

4. Ewing, B. and Green, P., 1998, "Basecalling of automated sequencer traces using 
   phred.  II. Error probabilities", Genome Research 8: 186-194.  

5. Green, P., 1994, Phrap, unpublished.  http://www.phrap.org

6. Gordon, D., Abajian, C., and Green, P., 1998, "Consed: A grapical tool for sequence 
   finishing", Genome Research 8:195-202.

7. Stephens M, Sloan JS, Robertson PD, Scheet P, Nickerson DA., 2006, "Automating 
   sequence-based detection and genotyping of SNPs from diploid samples," 
   Nat Genet. 2006 Mar;38(3):375-81. Epub 2006 Feb 19.

Score	Rank	Tag Color	True Positive Rate
99	1	red	97%
95-98	2	orange	75%
90-94	3	green	62%
70-89	4	dark blue	35%
50-69	5	magenta	11%
0-49	6	purple	1%

2-digit year	4-digit year	format
DMY	DMYY	day/month/year
MDY	MDYY	month/day/year
YMD	YYMD	year/month/day