1

Closed

Dummy reads in BAM file cause parser to crash

description

I am using the .Net Bio nuget package 2.0 and have noticed an issue with the BAM parser.

Parsing the BAM file crashes when it finds a dummy read. The SAM file below is a minimalist example:
@HD VN:1.3  SO:coordinate
@SQ SN:fakeref  LN:1000
*   768 fakeref 10  255 1M  *   0   0   *   *   CT:Z:.;ESDN;
Simply converting it to BAM and trying to parse it results in the following error:
Unhandled Exception: System.Exception: Run failure. ---> System.IndexOutOfRangeE
xception: Index was outside the bounds of the array.
   at Bio.IO.BAM.BAMParser.GetAlignedSequence(Int32 start, Int32 end)
Thank you,
Gabriel

file attachments

Closed Jun 29, 2015 at 12:50 AM by evolvedmicrobe
Solved with latest commit

comments

evolvedmicrobe wrote Jun 24, 2015 at 8:07 PM

Dummy reads appear to be a new feature in the BAM/SAM format. Do you know what the point of this is? This read may be spec compliant but I would argue it should be considered an error because the CIGAR indicates a read of length 1, but there is no sequence data present. I like the idea of forcing a correspondence between read length and cigar length.

Does anyone know what purpose these dummy reads serve? Would be good to support them fully if we want to stay spec compliant.

gabrielm wrote Jun 25, 2015 at 3:49 PM

Forcing the read length to match the CIGAR string was also my initial thought.

However, dummy reads are a special feature used for annotation purposes. For instance, they allow exon information from gene models in GFF3 format to be included in the SAM file; they allow suspected errors in the reference to be marked, and so on.

In general, the "*" ref and a CT tag will indicate that it is a dummy read.

evolvedmicrobe wrote Jun 27, 2015 at 11:07 PM

Ok, this should be easy enough to implement, I'll take another look after I clean up our testing framework a bit so I can make sure everything is kosher.

evolvedmicrobe wrote Jun 29, 2015 at 12:45 AM

Just added the ability to parse dummy reads in latest commit (https://bio.codeplex.com/SourceControl/changeset/5182b998b8839c69b322bb567bc097b159d26a61)

It appears the BAM file does not typically encode the "*" for the sequence (e.g. htslib), so leaves that to the parser to add back in. For dummy reads, we just leave the sequence as a null value, and the presences of a dummy read can be detected with a new IsDummyRead property on the SAMAlignedSequence.

Thanks for reporting this issue and providing test data!

gabrielm wrote Jun 29, 2015 at 5:16 PM

Thanks for that!

I'll give it a try.

gabrielm wrote Jul 8, 2015 at 1:25 AM

Sorry - is this included in the Nuget package?

I do not see any updates there.

Thanks!