Modified BAMParser suggestion

Developer
Oct 1, 2013 at 11:23 AM
Hi everyone,

So, the project I'm currently working on requires me to parse sequences from a BAM file, perform various calculations on these sequences, and finally write the output to a .csv file.

The only output possible from the existing BAMParser is a SequenceAlignmentMap that stores the entire BAM. The SequenceAlignmentMap is loaded into memory, and operations can't be performed on the sequences until the entire file has loaded. For a large file it will take hours to load the data and consume a lot of memory. And if your goal is only to process each sequence or batch of sequences as they are read, and maybe write output directly to another file, then you don't need to store the whole lot in memory anyway.

So that I can process sequences one by one and save memory on my tiny laptop and even tinier SSD, I've created a modified copy of the BAMParser. It's working really nicely for me, and I think it's a good candidate for inclusion in .NET Bio. I'm eager to see what the community here thinks about these proposed changes...

Essentially, I have added two new constructors: BAMParser(IMetricHandler handler) and BAMParser(IMetricHandler handler, bool storeMemory).

The first parameter takes an implementation of a simple interface IMetricHandler.
public interface IMetricHandler : IDisposable
    {
        // Called by modified BAMParser each time a sequence is created.
        void Add(SAMAlignedSequence sequence);

        void AddAll(Collection<SAMAlignedSequence> sequences);

        void ProcessSequences();

        // Called by modified BAMParser before Parse() returns).
        void FlushSequences();
    }
This could be implemented to do pretty much anything with the sequences. For example, my implementation collects each SAMAlignmentSequence as it is parsed from the file. When internal logic detects that it has received all of the sequences assigned to a particular cluster or reference chromosome (assuming they are ordered in the BAM file), it performs calculations on that cluster and writes the results to a .csv file. Then, optionally, if the cluster meets certain criteria based on those calculations, each sequence can be written back out to a new BAM file (a filtered version of the original file, ready for further analysis with other tools). If the storeMemory flag is turned off, the data from those sequences is then cleared from memory, making room for more.

If you want to incrementally process the sequences but also receive a SequenceAlignmentMap containing all of the data when the parser returns, this will happen by default unless you have turned the storeMemory flag off.

The first benefit is getting a quick result, especially while developing new code. There is no need to wait until all the BAM data is loaded into memory - it takes only a few seconds to start processing the sequences and see the result.

The second benefit is, if you don't need to store the entire BAM file in memory but if you only need to work with the sequences one by one, and perhaps store just a few, then you can significantly free up memory on your computer to do other stuff with.

So, this might not be the traditional way that most parsers work, but I am finding it to be very effective. I'm eager for some feedback...

Amber
Coordinator
Oct 2, 2013 at 3:07 PM
Hi Amber,

I'm excited to see some new contributors! This looks great and I like the idea. A couple of minor suggestions to think about:

First, I'd recommend to not rely directly on Collection<T> but instead either use ICollection<T> or IEnumerable<T> to AddAll. I'd prefer IEnumerable if you aren't modifying the input collection as it enforces a readonly, forward traversal. I'd also change the "AddAll" name to "AddRange" which is more commonly found in .NET (see List<T> for example). Lastly, if the model requires the Flush call at the end, you might consider using the Dispose pattern instead (i.e. implement IDisposable). That's the "official" way to indicate to a client that some end work must be done with the type when the client is finished with it.

This is great - keep it coming!

mark
Developer
Oct 4, 2013 at 3:04 PM
Hi Mark, thanks :)

ICollection<T> or IEnumerable<T> - of course, I'll change that right now. Honestly, for an interface that different people might choose to implement in a variety of ways I'm not sure how to decide which one to use, but your recommendation sounds good.

AddRange - definitely, oops. I'm new to C#.

As for FlushSequences(), I think I have named it poorly and commented it worse. It isn't intended to do memory cleanup, but it's used to signal to the IMetricHandler that it won't be receiving any more sequences, so if it's still waiting to process any it should do so. The IMetricHandler interface actually 'implements' IDisposable as well, for memory cleanup directly after FlushSequences() has been called.

Cheers for the feedback.

Amber
Coordinator
Oct 4, 2013 at 3:45 PM
Sounds good. Responding to your ideas --
  1. Go with IEnumerable<T> if you only need to read through the set of items with foreach(..). If you need to add/remove items, then ICollection<T> or ILIst<T> is the right choice.
  2. Ah, so FlushSequences indicates you are finished - I'd name it something like CompleteAdding(), similar to the BlockingCollection<T> class. That's probably not quite right given the terminology but CompleteXXX is probably a good pattern.
  3. Who is responsible for calling IMetricHandler.Dispose? I.e. how do you expect the call sequence will go - documenting that and going through it might help with #2..
mark
Developer
Oct 4, 2013 at 4:35 PM
'Flush' is more of a stream thing, so I should definitely rename it something. CompleteProcessing()? SetComplete(), similar to ContextUtil.SetComplete?

This is how I expect the calling of the IMetricHandler methods to go, integrated into existing code for BAMParser...
            IList<SAMAlignedSequence> alignedSeqs = GetAlignedSequences(chunks, start, end);
            if (storeMemory && metricHandler != null)
            {
                foreach (SAMAlignedSequence alignedSeq in alignedSeqs)
                {
                    seqMap.QuerySequences.Add(alignedSeq);
                    metricHandler.Add(alignedSeq);
                }
            }

            // some time later and in a different method.......

            if (metricHandler != null)
            {
                metricHandler.FlushSequences();
                metricHandler.Dispose();
            }

            return seqMap;   // SequenceAlignmentMap
        }
Amber