BAMParser Changes

Developer
Feb 9, 2014 at 5:41 PM
Hi All,

So I have been working with the BAM parser over the weekend quite a bit, with the goal of integrating/extending some of Amber's work on the ploidulator. I wanted to put forward a list of changes that I am making in addition to some basic code clean-up/refactoring.

The central issue is that our specification was not in aggreement with the files produced and consumed by SAMTools and Picard, which will often be an upstream toolset applied to BAM files. I am proposing that we change our file formats to match these tools as they are the most commonly used. Advantages: Compatibility. Disadvantages: Breaking change, all BAM files made previously in .NET Bio woudl need to be re-indexed after these changes.

To describe the changes I am proposing a bit more:
  • The BAM index files typically contain meta-data that is not standardized or documented. I changed our parser/formatter to match this pull request for the spec from the Biopython crowd: https://github.com/samtools/hts-specs/pull/2, that seems to be what most tools do right now. This brings our file formats inline with picard/samtools, the meta data isn't standardized but I think is as good as we can get.
  • The linear index for the BAM file is not adequately described in the specification. We were implementing an indexing method that was distinct from what is used by picard/samtools, and interconverting between formats was impossible. I am changing our code to match the binary output of picard/samtools and use their indexing scheme. This really should have been in the specification, but as it isn't, and samtools is so commonly used, I think we should try to have all our tools be compatible with its output, even if there isn't a working specification for the index file format yet. (SAMTools and Picard also produce slightly different files, but only the ordering of items changes).
  • We were also calculating bins differently. Although the exact calculation is given in the spec for an alignment start/end, it is not clear how to calculate start/end in some situtations (end or end+1, what to do when no alignment, or partial alignment, etc. etc.) I have tried to make some slight changes to our calculation which appear to match the samtools bin calculations. This really only affects reads that span a break at 2^14, as we would occasionally move it "up" or "down" the binning tree relative to samtools when things were off +/- 1. This is a relatively minor issue though.
So I have coded this and am getting ready to commit it, but it wound up being an awful lot and since I think this qualifies as a breaking change wanted to run it by people for votes. Also, some unit tests change due to the format changes, so I will likely change those (the goal would be to change the tests by making sure our output matches picard).

Any strong feelings, votes on this, etc?

-Nigel
Coordinator
Feb 10, 2014 at 7:00 PM
No objections here: while this does qualify as a breaking change, it makes the library behave in a more standard manner and I think that would be the expectation of any user.

We will document this and other changes in the release notes for any future release, of course.

Simon
Developer
Feb 11, 2014 at 3:36 AM
Simon, thanks, just committed the changes. I believe this should only be a breaking change in a few sets of edges cases.

With the change are some new unit tests to ensure functionality/compatibility as well as new data related to those unit tests. I always worry a bit that the new data files associated with the unit tests won't upload, so if someone can download the latest and validate the tests pass on a local machine it would be great.

Cheers to seamless compatibility,
N