Dec 2, 2013 at 12:44 AM
On 1 October I posted that I would be developing a cluster analysis tool for the double digest RAD sequencing pipeline (ddRADseq). The purpose of the tool was to analyse clusters of reads (each cluster intended to contain reads from the same locus), to
determine the accuracy of each cluster and enable bad reads to be filtered out.
A couple of weeks ago I finished work on a prototype for this tool, which I named Ploidulator. (I say 'finished' because this was a student project for me, and I just graduated. But I'll still have some involvement with Ploidulator - and more work will definitely
be done on it by me or others.)
Ploidulator utilises the .NET Bio BAMParser and BAMFormatter, and incorporates changes made to the BAMParser to enable iterative file parsing.
To copy a few paragraphs from my report:
Using this tool, a BAM file containing aligned reads can be parsed, and user-defined parameters can be applied to determine which clusters are ‘good’. Good clusters are then written to a new filtered BAM file. Various metric statistics are also
produced as text files, for both the original input file and the filtered ‘good’ output file, and this statistical data can be incorporated in future analyses of either the original or filtered files. Metrics include measures of per-cluster read and alignment
quality, read and population coverage, and ploidy-aware sequence count distribution and haplotype estimation.
Ploidulator also displays bar, line and pie charts for a dynamic visual representation of clusters as the input file is being parsed.
There are a few limitations:
The scope of this tool is to provide various basic metrics which may enable clusters to be filtered, while providing a framework into which additional metrics can be added or metric implementation details can be changed in the future. Regarding
the genotyping of each individual, the current genotyping process provides an estimate of the most probably genotype for each individual rather than a definitive analysis. Time constraints also prevented the implementation of a custom haplotyping component
so a third-party tool, PHASE v 2.1 (PHASE) developed by Stephens and Scheet (2005), is packaged with Ploidulator and used to compute population haplotypes based on the per-individual genotype estimates.
There are also a number of future areas for improvement that have been identified, and this list is by no means complete.
The tool performs as designed and offers a necessary degree of user customisability and a range of output metrics. However at this stage it has been tested on only one sample dataset, so more extensive testing is recommended, and there may be additional
metrics which could also be considered for inclusion. In addition the implementation of the existing metrics may benefit from being reviewed with opportunities for optimisation in mind, and a custom haplotype detection algorithm should be considered to replace
the current coupling arrangement with PHASE.
Ploidulator is designed to operate on any personal computer running Windows, and does not require unusually high memory or processing power. The process of outputting results directly to file conserves memory usage and means that the filtered file is available
for further analysis immediately after the analysis process has completed. For a ~1.3GB input file, the application will require ~4GB of free memory. Due to the threading model a 64bit environment is recommended.
Particularly with regard to improvements, BAMFormatter may need some attention. Ploidulator successfully takes a BAM input file and produces a filtered BAM output file containing a reduced number of reads. But while the output file is recognised as valid by
SAMUtils and can be parsed using BAMParser, the output file size is significantly
larger than the input file and the file write speed is very slow. I haven't had a chance to investigate the causes for this yet, and I will write it up as an issue.
So, watch this space for future Ploidulator developments. :)