Adding Parsing Options - comments requested

Jun 7, 2013 at 5:20 PM
Hi All,

I wanted to propose a quick change to the repository and get some feedback on it.

Proposal: Allow PadenaUtil to use fasta or fastq files that have been gzipped beforehand.

Reason: For whatever reason, most programs do not accept ".gz" files for assembly, but zipping them ahead of time can save a tremendous amout of space as the compression ratio on FastA and FastQ is usually very good (and would allow me to upload sample data to skydrive, which has a 2 GB limit).

Implementation: In theory this is pretty trivial, as we just take the existing FastA/FastQ files and create the input streamreader to use a decompression stream. However, there a couple ways to implement this, and it's only complicated because the FastAParser class is marked sealed. Some thoughts:
  1. Remove the sealed keyword on the parsers, inherit from them and override the stream creation.
  2. Don't reuse the old class, copy the overlapping code in to a new class.
  3. Keep the class sealed, but give it an additional parameter/constructor to take a gzipped file.
1 and 3 both seem sensible to me, I could go ahead and implement 1, but am not sure why the class was sealed so could do 3 if that seems like a better idea.

Jun 10, 2013 at 5:04 PM
Sounds like a reasonable feature to me, so I don't have any objection - but I'll leave it to others more conversant with programming to comment on the best way to implement the change.

Regarding timing - this is not likely to make it into the 1.1 release because we would like to get that done ASAP.

Jun 10, 2013 at 6:38 PM
I think we could unseal the classes in 1.1, that's a trivial change and not a breaking one. Then Nigel could implement the new parsers separately and we could incorporate them into the Bio.IO dll in the next release. Any objections from anyone to that?


Mark Smith

[email removed] | @marksm | 214-774-4749 |

Jun 11, 2013 at 4:58 PM
I think it makes sense but the question as Simon points out is timing.
There is a training happening in Brazil in early July. We clearly want/need a stable toolset to do the training.
Given that it isn't a breaking change that helps but getting this done, in, and tested or at least everyone feels confident about it needs to happen soon.
I see it makes sense to get ready for a follow-on release. Mark I am fine with your call on this one.

Jun 11, 2013 at 6:00 PM
Edited Jun 11, 2013 at 6:01 PM
I just checked these in, pretty trivial changes. I unsealed the class and made the parse method virtual, then updated the helper methods and parser-finder classes to point to the right file if it looked like a fasta or fastq file with a ".gz" extension.

I think there may have been a few updates to the code analysis between the first check-in of the original parser classes and now however (or new requirments when class isn't sealed). I got a lot of messages about the original FastAParsers and FastQParsers not implementing the IDisposable pattern correctly. Neither Dispose method actually cleans up a non-managed object however, as the StreamReaders are contained with using statements and never held as class level variables. The presence of the dispose method seems to be an artifact of these classes implementing the ISequenceParser interface, so I just turned off those warnings.