Padena - change to input, comments please!

Coordinator
May 21, 2013 at 2:30 PM
This message is an edited version of a mail from Nigel, but I thought it would be good to get wider opinions by posting on the forums, because this proposes changes to existing behavior:

PadenaUtil has an issue where if any read in the file contains a “N” or any ambiguity, then PadenaUtil.exe fails completely. Since essentially all data files will have reads with some ambiguous bases, this effectively makes the padena assembler useless for many people.

I would propose the following. Right now, the call to ValidateAmbiguousReads in PadenaUtil.exe kills the program if any reads are ambiguous, giving a fatal error. In contrast, during the graph building stage all the reads are validated and any ambiguous reads are dropped. It seems to me we could:

1- Continue to kill the assembler if any read is ambiguous and parse the read file twice at both the graph building an pre-processing stage (not my preference).

2- Simply remove the call to validate the reads in the executable and rely on the graph for validation/filtering.

3- Change the assembler to output a status message after building the graph with the total number of reads parsed as well as the percentage of reads with ambiguous bases, encouraging the user to trim the reads if they would like to use more data (I think this would be best).

Anyway, I can implement 2 or 3 if we think they would both be good. As an aside, I think it might also be nice to change the output so that “Total CPU time taken” is displayed before “total runtime.” Given that it is parallel, the runtime is less than the CPU time.
May 21, 2013 at 2:54 PM
HI

For this kind of cases, i was using the FilterReadUtil, but having 3 bases create a codon, if you have something like: AGTTTGCNC
you wil have: 3 codons: AGT TTG CNC, the last codon would be formed incorrectly, based in the FilterReadUtil approach. even more if the read who has the N or the ambiguos character is erased totally, the data will lose an important part of the final contigs results.

adding tho the simon proposals, i was thinking in another two options:

4- Based in MIRA Algorithm, take advantage of this, and make our implementation of de padena based on de Novo Mira algorithm.

5- I dont know if is possible, make a new padena implementation based in fastQ format files, I think this approach will be more safety in relation a N or ambiguos characters.

Regards,

Leo
Developer
May 21, 2013 at 4:18 PM
Leo, I think you misunderstand the problem, when a sequence contains an "N", that sequence can't be used to make an assembly (Life only uses A, C, G, T).

That said, I should probably add one additional, less lazy way to solve this problem. If a read contains an N at position say 45 of 75, rather than dropping the whole read, we can of course just drop k-mers from that read that create an "N" (so the start and end of the sequence are still used). This is probably better, but since reads with "N"s tend to be low quality at all other bases, and since they are a small (~<5%) fraction of reads, simply filtering them out and reporting the percentage removed seemed like a good first pass at the issue.

PadenaUtil can be easily modified to take FASTQ files if it doesn't already, seems like a great idea! (might be good to do this generally so it takes BAM/SAM/FastA/FastQ).
Coordinator
May 21, 2013 at 5:06 PM
I'm not sure why PadenaUtil was coded to only use FastA; I suspect it was just the primary input format available for data. I agree this would be a good thing and that It would be trivial to have it lookup the proper parser from the file input.

mark
May 21, 2013 at 7:26 PM
Edited May 21, 2013 at 7:26 PM
I have thought in this approach since a long time ago, I think this will be a good approach, i think is now the time of materialize this. i'm gonna tackle this.

Leo.
Coordinator
May 21, 2013 at 8:47 PM
Someone should check the code - I recall it WAS modified to use BAM/SAM...

Possibly those changes were backed out, because they were combined with a change in storage classes that impacted performance - maybe both things were reverted by mistake...?

Simon
Developer
May 22, 2013 at 9:09 PM
It seems to calls the FindParserByName method which can parse FastA,FastQ,Genbank and GFF. Didn't see BAM/SAM in there though, it looks like the last addition was Mark adding GFF files. Mark do you remember adding BAM/SAM?
Coordinator
May 22, 2013 at 9:24 PM
No, this was a guy called Manju - who left some time ago and is no longer involved in the project.

If Leo wants to add this in, that would be great.

Simon
Developer
May 22, 2013 at 9:44 PM
Woops, sorry for the confusion Mark! And I second that this would be a great feature to add Leo!
May 22, 2013 at 11:00 PM
Hi all!

Off course i'm interested in make this, for now i'm finishing the last part of a new functionality for blip, once i finish with this (this friday), i will continue with this, i'm clear with what I should to do for fastQ formats, but with the ,Genbank and GFF i don´t know, if anyone can explain me, i will be gratefull, on the other i have understood the SAM and BAM formats are Sequence Alignment/Map format, not assembly formats, so if anyone can help me with this too. actually the SAM and BAM formatters are Sequence Alignment formats, in accordance to the .NET Bio programming guide.

Thanks and regards,

Leo
Developer
May 22, 2013 at 11:03 PM
Hey Leo, FastQ, Genbank and GFF are already in there, so don't worry about that. You can see how they are loaded and just expand that code to do BAM/SAM, I am 99% certain they implement ISequence, so you can just have them spit that out.
Developer
May 22, 2013 at 11:04 PM
But back to the start of this thread, can I just modify the code to exclude reads with ambiguous bases rather than crash? Any nay votes out there?
Coordinator
May 22, 2013 at 11:09 PM
I like your alternative of excluding parts of the kmer chain where the kmers contain Ns. It seems to me this would conserve the largest amount of data. I assume based on your previous Padena work that you would also check that any kmer chain also consisted of an odd number of kmers. Do you think that would work?
Coordinator
May 23, 2013 at 6:12 AM
Plus one for the filter approach - especially given the lack of confidence in the other bases in the read.
Developer
May 26, 2013 at 4:53 AM
Edited May 26, 2013 at 3:14 PM
Simon, would it be alright by you if we went with the filter approach? Although admittedly it's a bit lazy, I think it really helps because it makes the code a lot easier to read, and because the low quality data at the tail end of the spectrum isn't really worth saving. It actually introduces a lot of noise.

On a related issue, I just found for my VCF parser that the server mode GC does improve performance, but the current PadenaUtil.exe is running in client mode based on the .config file. Mark, or anyone else, I am thinking that is simply because I forgot to check that in, and will probably update the .config soon...
Coordinator
May 27, 2013 at 1:36 PM
I can add the flag to the app.config. I assume it's just the gcServer switch?
Developer
May 27, 2013 at 2:47 PM
Edited May 27, 2013 at 2:50 PM
Excellent, just wanted to check that it was okay, flag is in there now.
Coordinator
May 28, 2013 at 3:13 PM
Nigel - yes, the filter approach is fine by me as well
Coordinator
Jun 10, 2013 at 5:06 PM
Hi Nigel,

Is there any progress on this? It is an open issue we would like to get into 1.1, but we have a very narrow time window for the release now.

Can you please let us know status, or if we should punt this to a future release?

Thanks,

Simon
Developer
Jun 11, 2013 at 6:12 PM
Hi Simon,

So I just found a hornets nest of version control issues while trying to merge this change back in from my branch. It looks like not only is my branch pretty well diverged, but the current code base has most of the earlier changes in PADENA rolled back still. Given that the issue of "N"'s is more a big annoyance than a bug and the need for a stable July release, it might be best to just wait till that code gets more settled before allowing PADENA to remove the N values. I believe Jim is currently working on quality metric output as well with PADENA, and this change can just be incorporated at the same time perhaps. I just created an issue and didn't assign a release to it, seems fine to punt to a future release.

Cheers,
Nigel
Coordinator
Jun 11, 2013 at 9:35 PM
Okay, sounds like a plan. I'll let Rick know, he is coordinating the next release.

Simon'