Sequencing Machine Format Parsers?

Coordinator
Aug 18, 2013 at 1:41 AM
Hi folks
I have some students who have signed up for a project to build support for NGS related processing on top of .NET Bio. I have pinged Kirt Haden from http://www.bionanogenomics.com/ for some suggestions and I will list these in a separate post. But I was also interested in people's thoughts on the general utility of NGS machine format parsers. I note the existence of the Bio.IO.AppliedBiosystems namespace, which plainly provides a good model, and implementing new parsers like this is a very good early stage component to a larger project.

So, what do people think about adding additional parsers along these lines? I am wondering particularly about extensions based on FASTQ as well, such as support for the Illumina identity strings ( see http://en.wikipedia.org/wiki/FASTQ_format#Illumina_sequence_identifiers for example).

Thoughts?

cheers
jh
Developer
Aug 18, 2013 at 1:48 AM
Edited Aug 18, 2013 at 5:26 AM
The Broad currently dumps all NGS sequencing data to a BAM right away, so at least on our end this is about the only thing we use (though occasionally they are converted back to FASTQ at a later stage). BAMs have meta-information in them that might be useful to be able to read in .NET. Bio.

Illumina strings could be useful. Given how much data these things produce being able to read/write the gzipped format directly I think could be a big win for us (and with the compressed reader classes it is so easy). I don't think storing this data as an uncompressed file is really feasible these days.

Anyway, just some thoughts.
Coordinator
Aug 18, 2013 at 2:01 AM
Thanks Nigel. There is of course Bio.IO.SAM as a namespace, and there is a class called SAMAlignmentHeader which has a bunch of Properties which make sense, but do not appear especially specific or detailed. This seems to be the result type for the header in the BAM parser (Bio.IO.BAM.BAMParser). If there are conventions in the use of these metadata, or if this doesn't capture everything, then this might be useful. Thoughts? Copy and paste from the docs (sans format) below:
Name Description
Public property Comments List of comment headers.
Public property RecordFields List of record fields. It holds all available record fields except comments.
Public property ReferenceSequences Holds the list of reference sequences name and length. SAMParser update this property from SQ header if present, else this will be updated from the each alignment information in this case length of reference sequence will be unknown thus set to zero. BAMParser update this property from reference information block and not from the SQ header. BAMFormatter uses this information to write reference information block. SAMFormatter does not requires this information, thus ignores this info.
Developer
Aug 18, 2013 at 5:33 AM
Hi Jim,

Thanks, I actually didn't notice the comments fields, despite spending a lot of time in that class, doh. There do actually seem to be some conventions for that class which might be useful to add (READGROUPS, LIBRARIES, ETC). These are used by the GATK/Picard ensemble of the Broad pipelines. However, when I brought our BAM parser up to date with the recent format, I definitely got the impression nailing down the SAM specification was herding cats.

Peter Cock who wrote the Biopython BAM parser and apparently hit the same problem I did with the specification being out of date put a pull request in to get it updated, but it's been sitting there for months with no one incorporating it, even though it has some useful updates and is quite good in my opinion. He mentioned that someone was just hired to work specifically on samtools for the english consortium, all of which is to say it might be worth waiting for that person to take over consolidating the specification before doing any more implementations, since as you pointed out we have a flexible and working version now.

Cheers,
N
Coordinator
Aug 18, 2013 at 11:43 AM
Edited Aug 18, 2013 at 11:47 AM
Thanks for that. Really useful comment. Will leave well alone for now, but maybe bring it back to the agenda as things develop.

Noticed in the 1.1 release notes that "New parsers are available which support .zip based files for FASTA and FASTQ parsing" which means that the Illumina version can't be that much of a stretch. Will suggest that as a starting point.
Coordinator
Aug 23, 2013 at 5:27 PM
Another thought would be to build some support for reading and writing to the various formats we support, but while they are compressed (zipped).

I have no objection to direct sequencer support, but it seems to me that relying on the instrumentation vendor to support commonly-used formats (which they do) relieves us of the need to create and maintain instrumentation-specific code in such a rapidly-evolving field.

There are worthwhile exceptions though - de-facto standards such as Illumina might be worth the effort of supporting and maintaining; emerging technologies also, in cases where the data generated is not storable in a standard format like BAM/SAM or FASTA. A further argument to do so would be if direct access to machine-specific formats provides a significant performance boost.

One thing I would advocate against though is co-opting comments fields to store structured information. There is a horrible track record of this in bioinformatics - for example on FASTA identifier lines.

Simon
Coordinator
Aug 23, 2013 at 5:46 PM
We already support zipped formats. That was part of 1.01. :-)

On Friday, August 23, 2013, sjmercer wrote:

From: sjmercer

Another thought would be to build some support for reading and writing to the various formats we support, but while they are compressed (zipped).

I have no objection to direct sequencer support, but it seems to me that relying on the instrumentation vendor to support commonly-used formats (which they do) relieves us of the need to create and maintain instrumentation-specific code in such a rapidly-evolving field.

There are worthwhile exceptions though - de-facto standards such as Illumina might be worth the effort of supporting and maintaining; emerging technologies also, in cases where the data generated is not storable in a standard format like BAM/SAM or FASTA. A further argument to do so would be if direct access to machine-specific formats provides a significant performance boost.

One thing I would advocate against though is co-opting comments fields to store structured information. There is a horrible track record of this in bioinformatics - for example on FASTA identifier lines.

Simon

Read the full discussion online.

To add a post to this discussion, reply to this email ([email removed])

To start a new discussion for this project, email [email removed]

You are receiving this email because you subscribed to this discussion on CodePlex. You can unsubscribe or change your settings on codePlex.com.

Please note: Images and attachments will be removed from emails. Any posts to this discussion will also be available online at codeplex.com



--

Mark Smith

[email removed] | @marksm | 214-774-4749 | julmar.com/blog/mark


Coordinator
Aug 23, 2013 at 6:19 PM
Oh sorry 1.1!

On Friday, August 23, 2013, Mark Smith wrote:
We already support zipped formats. That was part of 1.01. :-)

On Friday, August 23, 2013, sjmercer wrote:

From: sjmercer

Another thought would be to build some support for reading and writing to the various formats we support, but while they are compressed (zipped).

I have no objection to direct sequencer support, but it seems to me that relying on the instrumentation vendor to support commonly-used formats (which they do) relieves us of the need to create and maintain instrumentation-specific code in such a rapidly-evolving field.

There are worthwhile exceptions though - de-facto standards such as Illumina might be worth the effort of supporting and maintaining; emerging technologies also, in cases where the data generated is not storable in a standard format like BAM/SAM or FASTA. A further argument to do so would be if direct access to machine-specific formats provides a significant performance boost.

One thing I would advocate against though is co-opting comments fields to store structured information. There is a horrible track record of this in bioinformatics - for example on FASTA identifier lines.

Simon

Read the full discussion online.

To add a post to this discussion, reply to this email ([email removed])

To start a new discussion for this project, email [email removed]

You are receiving this email because you subscribed to this discussion on CodePlex. You can unsubscribe or change your settings on codePlex.com.

Please note: Images and attachments will be removed from emails. Any posts to this discussion will also be available online at codeplex.com



--

Mark Smith

[email removed] | @marksm | 214-774-4749 | julmar.com/blog/mark




--

Mark Smith

[email removed] | @marksm | 214-774-4749 | julmar.com/blog/mark


Coordinator
Aug 23, 2013 at 6:39 PM
I really must pay more attention :-)