Suggested formats for include in the next .NET Bio version

Jul 23, 2013 at 8:31 PM
Dear all

I'm looking for new interesting formats to include in the next version of the library, so, for all the comunity, which format do you think that will be desirable include?

if you want to see the existent formats, you can watch this, in the NET Bio_getting_Started_Guide in the parsers and formaters, section.

Leo
Developer
Jul 23, 2013 at 11:41 PM
Hi Leo,

Are you looking to input any format parser? I have some code to do VCF parsing, it's described a bit more here:

http://evolvedmicrobe.com/blogs/?p=71

I have been meaning to integrate it in to .NET Bio, but haven't been able to find the time recently. The code is largely good to go, but a few things need to be done.

1 - Comments, etc. need to be added so that it conforms to the .NET bio specifications and is usable, examples/unit tests would also be good.

2 - Make it faster than the Java version, which right now it isn't and that sort of annoys me. However, this should be accomplished reasonably quickly by following the comments listed at the end of the blog.

Some extras that might be nice before contributing it would be:

1 - Create the ability to write files in addition to reading them.

2 - Further "Sharpen" the code, for example by using lazy classes in C#, and cleaning out a few more layers of abstraction from the java.

Let me know if you have any interest in working on this. I can put the code on Github, we can work with it and then add it to the trunk.

Cheers,
N
Coordinator
Jul 24, 2013 at 6:18 PM
I would strongly support the addition of VCF - Dong Xie, who used to be active on this forum, asked for this some time ago. Also, very many genomics pipelines require the reconstruction of a genome for the purpose of variant-calling, then use the variants, for example in GWAS. VCF appears to be the de facto standard.

Another possibility would be FASTG, the new format supported by the Assemblathon. It would be nice to be able to parse this new format, for once stepping beyond what everyone else can do and instead being among the first to adopt.

A further option would be PDB format, to start moving in the direction of proteomics. While this would anyway be useful, it would be still better to also support some functionality downstream from file parsing, so .NET Bio could offer some more proteomics functionality.

Simon