short read aligner?

Feb 20, 2012 at 12:38 PM

Dear all,

Is there code/command for short read alignment like BWA/SOAP2 etc. in the toolset? Anything planned?

Best,

dong

Coordinator
Feb 20, 2012 at 9:03 PM

Dong, can you give us a specific commandline or detail what you are trying to do. Are you interested in a command line aligner?

Under the FAQ these are the aligners we support -

What alignment algorithms are part of the .Net Bio library and therefore availble in the Biology Extension for Excel add-in?

  • Smith-Waterman
  • Needleman-Wunsch
  • Pairwise-Overlap
  • MUMmer
  • NUCmer

    BTW we do have a sample application - Sequence Assembler.

    Happy to help further once we know your exact needs.

    Rick for the .NET Bio team

  • Feb 20, 2012 at 10:27 PM

    Hi, Rick,

    Nice to get your reply. What I mean is something to do serious work (i.e. align GBs fastq as fast and accurate as at least BWA or other real world aligners), and the reference seq is human.

    Best,

    dong

    Developer
    Feb 21, 2012 at 3:23 AM
    Edited Feb 21, 2012 at 3:23 AM

    Hi Dong,

    As far as I know, other than those that Rick has mentioned above, .NET Bio currently does not have an application for fast alignment of high-throughput sequencing reads to a reference genome like BWA or SOAP. They do have something for de novo alignment - PadeNa. 

    Cheers,

    Kevin

    Feb 21, 2012 at 2:45 PM

    Thanks both for reply.

    Another question: are we going to support VCF/BCF format soon?

    Best,

    dong

    Coordinator
    Feb 21, 2012 at 3:16 PM

    Hi Dong,

    Quickly on the alignment question - I know that people have used .NET Bio for alignment tasks, but they have constructed their own applications to do so, so it is possible. .NET Bio is really intended to be a library of functions and not a set of tools (although there are some tools, as Kevin points out) and so is more useful to the programmer. We could really use an efficient implementation of Burrows-Wheeler - if we had one, it would be comparatively simple to build a BWA-like tool.

    Regarding VCF/BCF formats, I am not able to speak for everyone who might add code to the library, but I am not aware of anyone with plans to add this parser. It would be very helpful to the project if you could go to the 'Issue Tracker' tab at the top of this page and add this as a request, with some information such as the version of the format you need, the type of work you need it for, and ideally a link to the format description. This feature request will then be visible to anyone in the community looking for features to add.

    Thanks,

    Simon

     

     

    Feb 22, 2012 at 5:37 PM

    Hi, Simon,

    As suggested I added VCF request to the Issue Tracker.

    Regarding short read aligner, I would think Illumina's ELAND, with SOAP by BGI and MAQ, are the first generation.

    After BWT is introduced, almost everyone used that in their own program, come the SOAP2, BWA, etc., thus the second generation.

    But BWT is not neccessary, it reduces the memory footage to hold the reference genome, say, human, SOAP will need <16GB, SOAP2 only needs <4GB.

    And my thinking is that aligner need to be compact and close to metal, since it basiclly is querying memory space billions of times for each running, that's why SOAP3 even went into GPGPU territory. I'm not that confident .NET could compete C/C++ on this task, even myself a .NET advocate.

    I would be happy to learn otherwise, or if someone already did something, even they don't want share their code, just let me know if that's possible will be enough to satisfy me.

    So my points: 1, BWT is not a must; 2, .NET Bio still need a fast aligner.

    As a side topic, I'm learning the basics with SOAPv1 C++ code, I manage to compile it on Windows to run single threaded, and will make it run multithreaded. Then onto BWA/SOAP2.

    Best,

    dong

    Feb 29, 2012 at 10:53 PM

    Update: SOAP v1 now run on x64 Windows, wonder if anyone interested.

    (BGI's website of SOAP v1: http://soap.genomics.org.cn/soap1/)

    Although first generation and not using BWT, but still fast and capable of real work.

    Best,

    dong

    Coordinator
    Mar 7, 2012 at 4:59 PM

    Hi Dong,

    We are loking at adding VCF format to .NET Bio as you requested. In order to do so, we need your help.

    • Could you please send us some example data files? We need real-world data to test our parser. 5-10 relatively small VCF files would be ideal, they would need to contain the full range of features you need in your work, so we can make sure we parse them.
    • Do you have a preference for the version of VCF you would like us to support?
    • Can you tell us a little about the way you plan to use the data? VCF format can contain  agreat deal of information, and w emay not initially support parsing of all fields. If you can please let us know which parts of the file you need - for example by describing the way in which you use VCF now, what you use it for and which pieces of information you need, it will help us write a more useful parser.

    Thanks for your help; please attach test files to the issue you created in the Issue Tracker; if they are too large let me know and we'll see if we can arrange some alternative.

    Thanks,

    Simon

     

    Mar 8, 2012 at 2:03 PM

    Thanks Simon for the fast speed on VCF request.

    There are plenty of these files on 1000G project FTP:

    ftp://ftp.1000genomes.ebi.ac.uk/

    To be specific the file I'm looking at is this one on the FTP server:

    ftp/pilot_data/paper_data_sets/a_map_of_human_variation/low_coverage/snps/CHBJPT.low_coverage.2010_09.sites.vcf.gz

    (regarding the related tbi file:

    ftp/pilot_data/paper_data_sets/a_map_of_human_variation/low_coverage/snps/CHBJPT.low_coverage.2010_09.sites.vcf.gz.tbi

    I'm not sure about this yet, TBI is yet another file format)

    Regarding version, I suppose the most recent will be better, i. e. v4.1.

    The usage of it, should be to read SNP/short indel info from public available data, and write into it from .NET Bio code or wrapper.

     

    Best,

    dong

    Coordinator
    Mar 13, 2012 at 8:02 PM

    Dong, The VCF format can be a bit tricky so I'd like you to be as clear and detailed as possible in your reply. If we can get specific details down I think what you want would be possible in our .NET Bio 1.01 release but any delay in understanding or a misunderstanding will put this at jeopardy of not getting done.

    What is the exact intent of your use of this parser?

    For instance do you want to be able to after reading this gerenate a full sequence or do you just want to be able to parse the data/file to see the differences? Please give as full and detailed an answer as possible.

    Next the version 4.0 format seems to be the more widely used and has the most samples. The 4.1 format seems to be fairly recent. In fact the only parser that seems to read that at this point is a C/C++ version. There isn't even a Java version that I could find. What specifically in the 4.1 format do you use/need that is not available in the 4.0 format?

    The 4.0 version has subtle differences like noting the title in the file format whereas 4.1 can be a link without even an extension.

    Coordinator
    Mar 13, 2012 at 9:25 PM

    Hi Dong,

    I'm interested in this as well - as a contributor on the project.

    .NET Bio supports reading SNPs using a a tab-separated format with the Bio.IO.Snp.SimpleSnpParser.  This returns a SparseSequence for each chromosome with either the first or second allele in each position (controlled by a property of the parser).  This approach supports replacements, but not really insertion/deletions or mixed records as VCF does.  It also doesn't provide for generating the alternative sequence - you get the deltas, but not the reference with delta's applied (although it's not terribly hard to do that).  It's bare-bones support today.

    I could see .NET Bio supporting VCF in order to generate a set of SNPs and indels from the VCF input file and then you would consume them - similar to what we do today with the SnpParser.  Is that what you are thinking? Or, as Rick suggested, are you looking to supply BOTH the VCF and reference sequence (FASTA, etc.) and have the parser generate a set of sequences with the SNPs applied?  I'm thinking you want the former based on your comments.

    You mentioned being able to write out in VCF as well.  .NET Bio doesn't currently have a mechanism to emit SNP records (there's no SnpFormatter).  Given that, this is a new area of development, so what would be your use-case to generate a VCF file? What kind of input are you thinking of supplying - multiple sequences to delta, or are you thinking you'd perform the delta yourself and then provide some type of data structure in the form of SNPs/indels?

    Thanks for your clarifications --

    mark

    Mar 16, 2012 at 3:37 PM

    Dear Rick and Mark,

    Sorry for this delay. While, you keep asking, so I'm forced to do some homework on this. :)

    My understanding: VCF4.0 is good enough to keep SNP, indel already, but then they have a SV extension (listed on two webpages:

    http://www.1000genomes.org/node/101

    http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/VCF%20%28Variant%20Call%20Format%29%20version%204.0/encoding-structural-variants

    )

    This extension basiclly deal with ##ALT=<ID=type,Description=description>, e.g. <DEL>, etc. so one no longer need to have a long ref seq in the file.

    Regarding 4.1, this extension is in the standard, then they add yet another part to deal with "Complex Rearrangement" stuff, which is the last part on the webpage with fancy graphics. I feel, let's just ignore this, I can't see this 'Complex' stuff can lead people to anywhere (my point is, this is something much over-developed).

    So 4.0 + sv_ext is enough. Frankly, I've no idea yet as to what .NET Bio should do with VCF, given it's just a plain text, I could always parse the header and read row by row to find what I want; also given .NET Bio is object based, I don't know what the parser you guys working on should emit what objects from the files, and what objects demand persistent into these files.

    If to re-phrase my request, I kind of think what I want is VCF-TOOLS and BCF-TOOLS, so that you could convert between formats and do some operations on these files.

    Maybe we could have some online discussion/sharing over Lync?

    Best,

    dong

    Coordinator
    Mar 16, 2012 at 4:52 PM

    If to re-phrase my request, I kind of think what I want is VCF-TOOLS and BCF-TOOLS, so that you could convert between formats and do some operations on these files.


    What format(s) would you want to convert to/from?

    mark

    Mar 16, 2012 at 4:57 PM

    VCF to BCF and vice versa.

    Coordinator
    Mar 20, 2012 at 9:58 PM

    Hi Dong,

    I'm a contributor on the project and have done some work building parsers for .NET Bio - specifically to support some different sequence formats.  I'm guessing you'd like to maintain all the information in both formats as you round-trip them.. Personally, I think it would be easier to port the bcftools command line to Windows for your specific requirements (it's been done before, and the source code has Win32 headers but the last version I see in binary form is 1.12 - see: http://sourceforge.net/projects/samtools/files/samtools/0.1.12/).

    I saw Rick from Microsoft added a work request to support it, but it wouldn't do a full-fidelity transfer - there are no data structures in .NET Bio to hold all the information contained in a VCF file today so it would require design work and really would need some real project to be used for testing and to shake out the bugs.  For just a translation between the formats it probably wouldn't be worth the effort.

    I say this because .NET Bio is really geared toward manipulating the sequences, reading and writing them are a small requirement of that effort.  You can certainly use .NET Bio to perform data conversions (and I personally have) but that's such a small piece of the functionality provided.  Adding VCF/BCF formatters and parsers just to do conversion would be a pretty big project to take on because it would require building some full-fidelity in-memory model representation to pass between the input and output side.  

    Have you tried compiling samtools under Windows?  Maybe you and I could chat offline, I'd bet it wouldn't be hard to get 1.18 to compile and run.  Shoot me an email: mark "at" julmar "dot" com.

    mark