Bam trouble

Nov 16, 2014 at 10:47 AM

I have been used BAMParserExtensions.ParseRange to access to different region of genome, but suddenly I've found that there are rare cases than the gene in one experiment has alot of reads, but in another very similar experiment there are no any read at all.
        static void Main()
            var bamFiles = new[]{@"D:/MichelData/SE/1/Aligned.sortedByCoord.out.bam",@"D:/MichelData/SE/2/Aligned.sortedByCoord.out.bam"};
            string geneName = "ENSMUSG00000091474";
            int start = 41599261;
            int end = 41628533;
            string chr ="7";
            using (BAMParser parser = new BAMParser())
                foreach (var bamFile in bamFiles)
                    SequenceAlignmentMap result = BAMParserExtensions.ParseRange(parser, bamFile, chr, start, end);
You can find those files here. It's huge files, but I don't know how else to reproduce the result.

I checked existence of reads in this region by samtools
samtools view -h Aligned.sortedByCoord.out.bam 7:41599261-41628533 > query
And in both cases I can find the non-zero set of reads (you can find the file "query" in the same place), so it's not for biological reason - something wrong with Bio.IO.BAM.Parser and I can't figure out for exact reason.

Nov 17, 2014 at 7:10 AM
Edited Nov 18, 2014 at 4:48 AM
Hi Pavel,

Well samtools working where our parser doesn't isn't great. Can I ask which version of the library you are using? (latest download or fresh git clone), this would help track the issue.

Nov 17, 2014 at 7:46 AM
Hi Nigel,

it was latest download from nuget - the runtime verison is v4.0.30319.

Nov 17, 2014 at 10:33 AM
Edited Nov 17, 2014 at 10:34 AM
Hi Pavel,

Hmm.. I can't reproduce this with samtools. Trying to read the interval in the downloaded file returns no reads:
samtools view Aligned.sortedByCoord.out.bam 7:41599261-41628533 | wc -l
And the file I downloaded appears to itself be corrupted:
 samtools view -h Aligned.sortedByCoord.out.bam > test
 [main_samview] truncated file.
One possibility is we are using different versions of samtools or that the downloaded file was corrupted. However, if I take your query file and make a bam out of it:
samtools view -h -b query.sam > test.bam 
samtools index test.bam 
Then this file works fine in both samtools and .NET, and all the reads are returned. Are you using the latest version of samtools? Would you mind converting your query file and trying it in .NET Bio?

Nov 17, 2014 at 11:57 AM
Edited Nov 18, 2014 at 9:14 AM
Ok, I've done it again and convert bam to sam as well - without any error.
[BioTrouble]$ samtools view -h 1/Aligned.sortedByCoord.out.bam 7:41599261-41628533 | wc -l
[BioTrouble]$ samtools view -h 2/Aligned.sortedByCoord.out.bam 7:41599261-41628533 | wc -l
[BioTrouble]$ samtools view Aligned.sortedByCoord.out.bam > test
[BioTrouble]$ samtools --version
samtools 1.1
Using htslib 1.1
Copyright (C) 2014 Genome Research Ltd.
As you I take query file and make bam/bai, and it works fine in both cases as well. I deposit query.bam/.bai in the same place.

I see three possibilities (can occur simultaneously):
  1. we use different versions of samtools
  2. your downloading process was interrupted
  3. under certain conditions something wrong with .NET Bio
Nov 18, 2014 at 4:41 AM
Hi Pavel,

Almost there, after trying the download on a better server, I was able to get a file that produced your result with samtools v1.1 and v1.0. With one exception, I think you also included the -h parameter to samtools, in order to get the 1143 count, otherwise it is 1074.

In any event, there was a bug in the indexing implementation, which is now fixed with this patch:

I'll ask Mark if he'd be willing to put out a new nuget package for you to work with.

Nov 18, 2014 at 2:36 PM
I'll see if I can get one pushed today with the latest fixes. Thanks Nigel & Pavel!

Nov 18, 2014 at 8:02 PM
I've pushed a new build up to Nuget -- Pavel, you should be able to just update your packages in your project to get the fix.