So, the project I'm currently working on requires me to parse sequences from a BAM file, perform various calculations on these sequences, and finally write the output to a .csv file.
The only output possible from the existing BAMParser is a SequenceAlignmentMap that stores the entire BAM. The SequenceAlignmentMap is loaded into memory, and operations can't be performed on the sequences until the entire file has loaded. For a large file
it will take hours to load the data and consume a lot of memory. And if your goal is only to process each sequence or batch of sequences as they are read, and maybe write output directly to another file, then you don't need to store the whole lot in memory
So that I can process sequences one by one and save memory on my tiny laptop and even tinier SSD, I've created a modified copy of the BAMParser. It's working really nicely for me, and I think it's a good candidate for inclusion in .NET Bio. I'm eager to see
what the community here thinks about these proposed changes...
Essentially, I have added two new constructors: BAMParser(IMetricHandler handler) and BAMParser(IMetricHandler handler, bool storeMemory).
The first parameter takes an implementation of a simple interface IMetricHandler.
public interface IMetricHandler : IDisposable
// Called by modified BAMParser each time a sequence is created.
void Add(SAMAlignedSequence sequence);
void AddAll(Collection<SAMAlignedSequence> sequences);
// Called by modified BAMParser before Parse() returns).
This could be implemented to do pretty much anything with the sequences. For example, my implementation collects each SAMAlignmentSequence as it is parsed from the file. When internal logic detects that it has received all of the sequences assigned to a particular
cluster or reference chromosome (assuming they are ordered in the BAM file), it performs calculations on that cluster and writes the results to a .csv file. Then, optionally, if the cluster meets certain criteria based on those calculations, each sequence
can be written back out to a new BAM file (a filtered version of the original file, ready for further analysis with other tools). If the storeMemory flag is turned off, the data from those sequences is then cleared from memory, making room for more.
If you want to incrementally process the sequences but also receive a SequenceAlignmentMap containing all of the data when the parser returns, this will happen by default unless you have turned the storeMemory flag off.
The first benefit is getting a quick result, especially while developing new code. There is no need to wait until all the BAM data is loaded into memory - it takes only a few seconds to start processing the sequences and see the result.
The second benefit is, if you don't need to store the entire BAM file in memory but if you only need to work with the sequences one by one, and perhaps store just a few, then you can significantly free up memory on your computer to do other stuff with.
So, this might not be the traditional way that most parsers work, but I am finding it to be very effective. I'm eager for some feedback...