VCF-Parser - Integration

Developer
Apr 3, 2014 at 8:27 AM
Hey,

I'm Thomas, the exchange student Jim was always talking about.
I set up my environment for the integration of the VCF-Parser and I know how to access the VCF-Files and get the data.

As integration area, I thought about the IParser and ISequenceParser-Interface, but the ISequenceParser-Interface doesn't seem to fit the purposes of the VCF-FileFormat.
As I'm not a bioengineer, we discussed this today and the VCF-FileFormat seems to hold MetaData about the chromosome and no sequences as e.g. the GenBank-files (what the ISequenceParser expects as output).

Is there another interface/namespace that better fits the VCF-FileFormat? Or do I get something wrong?
Developer
Apr 3, 2014 at 7:06 PM
Hi Thomas,

I don't think ISequenceParser makes sense either, it appears that it is used in the library for the XSV file parser as deriving from this somewhere, but this seems like a bad idea.

It is not immediately clear to me how to make a general variant interface. Broadly, variant file formats can be divided in to those that only report on biallelic sites (basically SNPs, like in the XSV or PED file format for plink, eigenstrat, etc.) or those that handle more complex variation (like vcf files, which have indels, multiple alleles, etc.). Easily working with both types of files would be great, and you could check out the various specs to decide what a nice overall interface to implement might be. If doing this, ideally it would be nice to have both the parser and the things return by the parser implement the same interface (e.g. an IVariantContext or something).

However, I would say you are also fine just implementing IParser for now though, as the most important thing I think will be getting a working parser and supporting bcf if possible.

Cheers,
Nigel
Developer
Apr 4, 2014 at 7:36 AM
Hey Nigel,

but implementing IParser doesn't provide a way of reading files because the IParser "only" holds the main information like "Name", "Description" and "FileType" - and the next Interface that implements IParser (e.g. ISequenceAlignmentParser, ISequenceRangeParser, ISequenceParser) provides methods for opening the file and parsing it. This had to be done like this because the "Parse()" method returns different values depending on the data structure in the file.
So i was wondering, if there is already such an "second level" Interface that fits the VCF-Data or if I have to define one.

If I have to define one, I think it should have the following methods:
  • void open(string path)
  • IEnumerable<VariantContext> Parse()
  • IEnumerable<VariantContext> Parse(StreamReader)
  • void close()
(This structure is based on the other interfaces.)

What do you think?

Cheers,
Thomas
Coordinator
Apr 4, 2014 at 1:46 PM
Hi Thomas,

I think we need to refine the interfaces some. It's on the plan for this release sometime in June.

The IParser interface is really just a way of identifying parsers generally by name - but there's no requirement on the parser itself (i.e. for how it gets data, what it parses to, etc.). ISequenceParser was really all about sequences. I think we need something in between - an IParser<T>. My "in-memory" definition looks like:

interface IParser<T> : IParser
{
IEnumerable<T> Parse(StreamReader);
}

I'd leave open/close/parse(void) off - they are unnecessary when you have Parse(StreamReader). That's actually a much better way since it's platform-agnostic. The problem with open(string) is that you can't define a cross-platform way to implement it. Our next release will target Mono/mobile/Metro and Desktop in one magical set of assemblies so we will be refactoring things to allow us to more easily accomplish that goal. My plan is to remove things like Open(string) off the interface itself and instead move them to extension methods specific to the platform - so the platform decides "how" to open a file and then passes the stream into the cross-platform parser code - i.e.:

static class IParserExtensions
{
public static void Open<T>(this IParser<T> p, string filename);
public static void Close(this IParser<T> p);
public static IEnumerable<T> Parse(this IParser<T> p);
}

But, we're not there yet - so here's my suggestion:

Implement the parser with your current plan - create an interface derived from IParser, get the parser running with samples and unit tests and then we'll migrate it to the newer architecture from there - that way we aren't testing two things simultaneously (less moving parts). How does that sound?

mark
Developer
Apr 7, 2014 at 7:37 AM
Hey,

the redefinition of the IParser-Interface with templates sounds very reasonable to me and due to the restrictions of the cross-platform-implementation it might really be very useful to have this "outsourced" to the platform-specific code.

So I started implementing the Parser-Interface to read VCF-Files and its implementation.
Will sit down with Lawrence tomorrow to set-up the integration into the FoundationServer-Project.

Cheers,
Thomas
Developer
Apr 8, 2014 at 6:52 AM
Hey,

I looked at the other implementations of the Parser-Interfaces (like ISequenceAlignmentParser, ISequenceRangeParser or ISequenceParser) and it appears to me, that they don't have a common "structure".

Some have the "Open()" method and some just have the "Parse(String filename)"-methods.
Some have the "Parse(TextReader)", some the "Parse(StreamReader)"-method implemented.
Some use "IEnumerable", some use "IList" as a return type.

Mark, you sad already, that this will be restructured later this year.
But what methods shall I implement to make this restructuring as easy as possible?

Therefore some questions:
  • What shall the "Open()"-Method do? Does it just check if the file exists or does it already start reading the header of the file (this would be more convenient for me)?
  • Do we need a "ParseLine()"-method? Or is it enough to just have a list of all VariantContext-Elements?
  • Do we need a "Parse(StreamReader)"-method - because the StreamReader is already generated in the Open-Method... If we need this method, where does the StreamReader start? At the header or at the beginning of the information to be parsed?
I ended up with this interface idea:
public interface ISequenceVariantContextParser : IParser, IDisposable
{
    // sets the filename for the parser and reads the header
    void Open(String filename);

    // read one line of the data
    VariantContext ParseLine();

    // read all data, closes the stream automatically
    IEnumerable<VariantContext> Parse();

    // close the stream
    void Close();
}
or more simpler:
public interface ISequenceVariantContextParser : IParser, IDisposable
{
    // sets the filename for the parser, reads the header, all data and closes the stream
    IEnumerable<VariantContext> Parse(string filename);
}
But this is not always corresponding with the existing implementations and it doesn't use the StreamReader what would be useful for the further steps...

Cheers, Tom
Coordinator
Apr 8, 2014 at 12:57 PM
I would have thought that
  Parse( TextReader reader ) 
would be both more flexible and platform neutral than either
  Open( string filename )/Parse() 
or
  Parse( StreamReader reader ).
Use of TextReader allows us to conveniently parse a wider range of Readers connected to strings, byte arrays and memory streams as well as files. This form used to exist in the earlier (MBF) versions of the genbank parser but it disappeared as of v1.0, and I have missed it.

However, if for some reason there is some reason (does the reader need to support seek or something?) that I cannot see to avoid TextReader, StreamReader is the better of the two options because it does not pre-suppose that the sequence resides in a file. I seem to find them lying around in strings quite frequently...

Cheers, Lawrence.
Coordinator
Apr 8, 2014 at 1:27 PM
Edited Apr 8, 2014 at 1:29 PM
I'd like to just have Parse on the interfaces. Provide open/close semantics along with dispose (for close w/using) on the class, but the interface should be as loose as possible to allow for a variety of mediums (think network, db, strings, etc.). So, I'm thinking:
public interface ISequenceVariantContextParser : IParser, IDisposable
{
    // Reads each element from the passed reader.
    IEnumerable<VariantContext> Parse(TextReader reader);
}
Then your implementation can do this:
public class SequenceVariantContextParser : ISequenceVariantContextParser
{
   public string ID; // from IParser
   ...

   // These are just part of the class - no interface requirements.
   public void Open(string filename);
   public void Close();
   public void Dispose();
   public IEnumerable<VariantContext> Parse();  // assumes Open was used, forwards call to below:

   // This is IParser<T> (future), for now ISequenceVariantContextParser
   public IEnumerable<VariantContext> Parse(TextReader reader);  // all in one - no need to call open. 
}
This closes a current ambiguity in the library - when you get a parser from SequenceParsers.FindParserByFilename, it OPENS the parser for you. But if you use SequenceParsers.All to find one, it does not. We could potentially remove the Open/Close/Parse once the interface is refactored and move them to extension methods for all IParser<T> types. The ISequenceVariantContextParser will disappear and get replaced by IParser<T>.

Does that make sense?

mark
Coordinator
Apr 8, 2014 at 1:33 PM
This makes excellent sense Mark, thanks.

Lawrence.
Developer
Apr 9, 2014 at 4:01 AM
Allright, that makes sense for me as well.

The method
public IEnumerable<VariantContext> Parse(TextReader reader);
has to read the header of the file as well, doesn't it? Means the TextReader is at the beginning of the file, not somewhere in between...
Coordinator
Apr 9, 2014 at 4:05 AM
It should be OK to stipulate that the TextReader is positioned "ready to read" the data as part of your documentation. That would be considered pretty normal.
Developer
Apr 9, 2014 at 4:38 AM
Hi all,

So I am not sure that using the TextReader class as the argument in the interface is the best idea. For example our BAM parser has this adorable bit of code in it for the "implementation" of its interface
 IList<ISequenceAlignment> ISequenceAlignmentParser.Parse(TextReader reader)
    {
        throw new NotSupportedException(Properties.Resource.BAM_TextreaderNotSupportedMessage);
    }
And I think we might be setting ourselves up to put a lot of "not implemented exceptions" in future developments if we go with text readers here. I think when discussing an interface, we need to be aware of the fact that even if in this release we are only using text files, in practice VCF files can come in three common forms.

1- A Text File
2- A Gzipped Text File
3 - A Gzipped BCF file (binary file).

What would make the library great (and useful) is if we have things setup so that the caller's code works the same regardless of which of these file types is passed to the class, the class can just figure it out.

Right now the parser works on text files, but realistically compressed files will be much more common in the future. A standard VCF file, the dbSNP file for b37, is 11 GB in size, and these files are only going to get larger. Realistically, they should already be compressed, but they aren't. In my humble opinion this is because the linux command line tool kit is not well equipped to deal with fancy scenarios like this (running awk on block compressed files is a difficult, calling an agnostic C# enumerator is not), so these inefficiencies stick around. However, I think our library should be ready to handle compressed data, or data over a network, etc.

I think if we went with a text reader dealing with compressed data becomes much harder, so streamreader and/or a string argument might be the easiest. However, I won't claim to know the answer here, but did want to point out that I think we should consider preparing for the future of compressed file formats with anything we do here. Also, while most of the discussion so far has been on sequence parsers, fasta and genbank files are very different from VCF files, which are in a league of their own with PLINK files, array data, etc. Right now, this is our only variantcontextparser, but these other formats make heavy use of compressed formats that require unpacking bits, so realistically won't typically come in as text files.

So does StreamReader, string or something else seem like a better type to pass as the argument?

-Nigel
Developer
Apr 9, 2014 at 5:38 AM
For now, I'll just use the TextReader to continue with my implementation.
Changing this to a StreamReader is not going to be the problem as I internally anyway use the StreamReader...
And follow this discussion to change it in case ;-)

Passing the string would need changing the code a bit, because the file is read line by line (not much, but this has to be considered for the other parsers as well).
Developer
Apr 9, 2014 at 5:47 AM
Woops, misspoke (typed I suppose) on the last comment. StreamReader inherits from TextReader so there is no difference there. Would still love to implement with future compression in mind though. FileStream (used in BAM) inherits from Stream, which might be the most general option we could pick, converting from textreader to this might be a problem...
Coordinator
Apr 9, 2014 at 6:02 AM
Stream may be the best bet in the long term, and also for Thomas' project.
Files can open FileStream; Strings or byte arrays can be wrapped in MemoryStream, If an implementation really wants text it can wrap a StreamReader around the Stream.
With regard to the multiple packing forms for VCF, I'm not sure how easy it is to sniff out the difference between GZipped text and binary if the VCF reader tries to handle all three formats. If the Stream does not support seek and we have to consume part of it to identify the content type we would be stuck.
Both text and gzipped text can be handled by the same parser, provided that a suitable stream is created. The gzipped BCF form might need a separate parser.
Developer
Apr 9, 2014 at 6:57 AM
Edited Apr 9, 2014 at 6:58 AM
Agreed that seamlessly switching between the different VCF forms could be tough, right now for instance the BAM/SAM parsers are implemented separately, but given that the data produced is so similar I wish this wasn't the case. In any event, I am more than happy if we stick with Stream as I think this will be plenty flexible for the future.

I think that if in later iterations we try to sniff out the VCF type, it will be a pain but might not be too hard. To produce a variant context one has to consume the header anyway. BCF files begin with the magic string "BCFXX", so are readily identifiable, and standard VCF also has a canonical first line (##fileformat=VCFvXX). True, if we can't seek on the stream we are in trouble, but if the file is 11GB and we can't seek we have problems anyway, so expect in most use cases this won't be problematic.

I just wanted to make sure we put ourselves in a good situation to implement either other variant file formats with a similar interface, and/or give ourselves the option to be able to both index and query these types of files using the tabix approach. We have about 75% of a TABIX implementation with the BGZF format and have already solved all the hard/buggy stuff with the BAM parser. At some point I might try to refactor this out to make it more transparent and extend it to everything, in the meantime, think this won't make it anymore difficult!

Thanks for all comments,
N
Coordinator
Apr 9, 2014 at 12:25 PM
I'm ok with Stream too, it just lowers the bar quite a bit because Stream is pretty low level. If we decide to use Stream as the interface input, I think we'll want a base abstract Parser<T> class which has some helpers to verify the stream is readable and has data. We might also want to think about encoding styles, particularly for cross-platform.

For compressed streams, I wonder if we could somehow layer the parsers - i.e. detect (based on extension) a gzipped file and use the decorator pattern to layer parsers together so the final parser only deals with uncompressed data. Nigel - you've worked with these the most, what's your take?

mark
Coordinator
Apr 10, 2014 at 4:54 PM
What great thread - I'll have to go on vacation more often :-)

Simon
Developer
Apr 10, 2014 at 5:16 PM
Hi All,

I think the decorator pattern could work well to wrap the stream at various levels, and think layering the parsers is a great idea. One of the problems with the BAMParser class is that it is really confusing to read because the logic for handling the decompression is mixed in with the logic for manipulating the uncompressed data, and I just wanted to avoid doing that again. Anyway, happy to see it implemented however.

Seems like we have a consensus on using stream, and I would vote an overload that takes a string object (e.g. filename) and constructs a stream based on this. I think we should have the general parser then check the first 8 bytes of the stream for either the gzip file header (0x1f-0x8b), the BCF magic ID (BCF) or the VCF header (##Variant), and return an IEnumerable<VariantContext> from what can appropriately parse each one. Internally, this would probably be one class that can parses a text stream (either compressed or not, which we have now), or another class that parses BCF files (which we don’t have yet, but hopefully will later).

Cheers,
Nigel
Developer
May 8, 2014 at 6:27 AM
To test if the integration of the parser worked, I started writing Unit-Tests.
But I'm struggeling to find VCF-Files. Does anybody have some?
Developer
May 8, 2014 at 1:12 PM
Developer
May 29, 2014 at 1:54 AM
Edited May 29, 2014 at 4:28 AM
Hey Nigel,

what version does your VCF-Parser support?
I tried using the parser with files from the BroadResourceBundle (havin the version VCFv4.1).

For example for the file "NA12878.knowledgebase.snapshot.20131119.b36.vcf", I keep getting an error, that he wants to add the value "GATKCommandLine" twice to the dictionnary.

Do you have me a file you tested the parser with? So that I have a reference where it should work?

Cheers, Thomas

PS:
I tried to debug a bit more and it seems to me that the parser - while reading the header - he adds the first "GATKCommandLine" as expected.
After that, he wants to add another "GATKCommandLine" from the source file and obviously he can't manage that anymore because this key is already existing.
If I delete one of the "GATKCommandLine"-line arguments in the document, he can read the header with one command line.
The problem occurs in "VCFHeader::loadMetaDataMaps()" where "GATKCommandLine" is handled in the last "else"-path.
Developer
May 29, 2014 at 4:15 PM
Hi Thomas,

So, I just ran that test file through my parser and it seems to be working fine. If I grep, the GATKCommandLine only appears once, so it shouldn't be added twice to begin with. I just pushed to github that file as well as an example parsing it, can you please describe how you obtained this error a bit more?

My example files are probably too large to transmit easily, but it matters much more that the 1000 Genomes files work anyway.

I support v4.1, if you felt like implementing 4.2 that would be great. The GATK emits 4.1, so I think that is a decent goal to get started, but implementing 4.2 would be awesome as well. I haven't seen v4.0 in years so wouldn't worry about it.

Cheers,
Nigel
Coordinator
May 29, 2014 at 11:10 PM
Hi nigel
Thanks for that. A quick look here: http://www.1000genomes.org/wiki/analysis/variant-call-format/vcf-variant-call-format-version-42 says that the format is not yet finalised, but I assume that this is the best guide. Thomas, can you look at this and scope any additional work?
Developer
Jun 17, 2014 at 10:33 AM
Hey,

sorry for the late response, I had exams the last two weeks and was busy studying.

Concerning the file format: I'll test the code with the v4.1 file format as my test files from the BroadResourceBundle are in v4.1.

I had one problem with my testing procedures that I fixed, but I still keep getting exceptions for some files (also with your test-implementation):
The files are the following:
  • 1000G_omni2.5.b37.vcf (problem with key: reference)
  • 1000G_phase1.indels.b37.vcf (problem with key: reference)
  • CEUTrio.HiSeq.WGS.b37.bestPractices.b36.vcf (problem with key: GATKCommandLine)
  • CEUTrio.HiSeq.WGS.b37.bestPractices.b37.vcf (problem with key: GATKCommandLine)
  • dbsnp_138.b36.excluding_sites_after_129.vcf (problem with key: GATKCommandLine)
  • dbsnp_138.b37.excluding_sites_after_129.vcf (problem with key: GATKCommandLine)
  • NA12878.knowledgebase.snapshot.20131119.b36.vcf (problem with key: GATKCommandLine)
The error appears in OrderedGenericDictionary.cs, line 26 - with error „Argument exception was unhandled. An element with the same key has already been added.“

The reason I figured out was, that the program doesn’t expect two (or more) entries of „reference“ or „GATKCommandLine“ parameter.
Developer
Jun 17, 2014 at 1:49 PM
can you grep the files to see how many times that line appears? And upload a reproducible error case to github? I can't explain how we are winding up with different behavior.
Developer
Jun 18, 2014 at 7:30 AM
I have the files and for example for the file "NA12878.knowledgebase.snapshot.20131119.b36.vcf" has the following lines:

[...]
GATKCommandLine=<ID=ExtractConsensusSites,Version=2.7-63-g8d09086,Date="Tue Nov 19 10:27:42 EST [...]
GATKCommandLine=<ID=FilterLiftedVariants,Version=2.8-1-g2a26ec9,Date="Fri Dec 06 14:42:10 EST [...]
[...]


So, the GATKCommandLine appears twice in the file.

Unfortunately I can't upload anything to github ("Permission to evolvedmicrobe/Bio.VCF.git denied").
The only line I added to your code was fname = "testData/NA12878.knowledgebase.snapshot.20131119.b36.vcf";
Developer
Jun 18, 2014 at 7:39 AM
Coordinator
Jun 19, 2014 at 1:39 PM
Edited Jun 19, 2014 at 1:45 PM
Hi Thomas,

Where is the VCF parser code itself? I don't see it in the source tree yet..

EDIT: nevermind, I found it in the branch.
Developer
Jun 19, 2014 at 2:29 PM
I worked on the parser offline so far, so I didn't check anything in yet.
So there should be nothing in the branch yet?!

The current progress:
  1. I fixed the "Doubled Dictionnary Key" - Problem by adding the header information "source", "GATKCommandLine" and "reference" as vector instead of just once in a dictionnary.
  2. Another problem popped up, where it sais "MemoryException" - I'm still figuring out, why this happens.
  3. For the import of the parser code, I had to change three settings. To finally include the VCF-Parser, these settings have to stay changed or the problems fixed.
    Treat warnings as errors: None
    Allow unsafe code
    Code Analysis switched off
How shall I continue with this? Can I commit it into the v2.0 branch?
Coordinator
Jun 19, 2014 at 2:50 PM
Sure, that would be great! It's actually a fork right now:

https://bio.codeplex.com/SourceControl/network/forks/markjulmar/version2?branch=biov2
Developer
Jun 19, 2014 at 8:52 PM
Hi Thomas,

Yep, that double entry fails. It looks like it was a silent problem in the original parser but slipped through as just being overwritten (see below).
https://github.com/samtools/htsjdk/issues/43
Sounds like you have a solution already, which is great.

In regards to the unsafe code, we should almost certainly take that out... it not only can create issues when someone tries to run code without permission, but on the Microsoft runtime it is probably not much benefit. In fact, looking over those methods now, I notice they are probably not as efficient as I hoped either (for example SplitByCharacters allocates and int[] on the heap that can probably be avoided).

Cheers,
Nigel
Developer
Jun 20, 2014 at 2:41 AM
How can I commit into the v2.0 branch? Using Git?
What credentials do I have to use to commit my code?
Developer
Jun 20, 2014 at 2:51 AM
Edited Jun 20, 2014 at 2:52 AM
Thomas, on github are you phanthomas09?
Developer
Jun 20, 2014 at 2:53 AM
Also, you don't need any credentials to contribute. Simply "Fork" marks version v2, make changes, push to your fork on codeplex, then send a pull request to integrate.
Developer
Jun 20, 2014 at 2:56 AM
Yes, I am phanthomas09
Developer
Jun 20, 2014 at 3:07 AM
Ah okay, I saw your comment on github. You realize that repository is a Java version, and does not through an error as our code does (or do you know something I don't about the java collections classes?).
Developer
Jun 20, 2014 at 3:09 AM
True, true
I thought it was the comment on the C# solution ;-)
sorry
Developer
Jun 20, 2014 at 9:49 AM
I uploaded the code to Nigels fork https://git01.codeplex.com/forks/evolvedmicrobe/thomas
As most of the test data are very large, you will only find the small ones uploaded for now.

Concerning the "MemoryException" I had earlier:
This only appears for all files that are bigger than 200MB. Can someone test e.g. TC08 (in Bio.TestAutomation > IO > VCF > VCFTestCases.cs) on his computer with more RAM?

@Nigel: Could you have a quick look over the file "VCFParser.cs"? I tried to change as little of you code as possible. But considering the MemoryException, I'm not sure if I'm calling your parser properly.
@Mark: Can you add the new Parser-Interface to the code so that I can use the changed interface?

Thanks for the help!
Developer
Jun 20, 2014 at 1:59 PM
Thomas, thanks for working on this parser. By any chance are you still available to work on this? This code has a number of issues and really isn't in an acceptable state yet for a pull request.

1 - You uploaded huge test files, and they weren't even gzipped. In general, we don't want to massively grow the test data for the library, which is already a problem. Uploading an entire VCF from the 1000 genomes (genomes being quite big), does that. We need small tests that get at the kernel of a process or issue. One of the big advantages of this parser over the Java version was also supposed to be the option to use gzip files, so we can keep our test data even smaller. The problem is that if it takes a potential contributor hours to download the project, and is very expensive in terms of disk space, we will lose that potential contributor.

2 - You have 7 commits labelled "- new test data" or some version there of. Ideally commits should be collected units. I know you are new to git. There is a great (and free) book online called "git succinctly." You can find it on google and it would probably be worth reading so you know git, which is a valuable skill. It describes a few short commands to "rebase" and "squash" your commits, so that they become one commit in the pull request.

3 - The unsafe code flag is not suitable for this library, we need to remove any unsafe code barring a large and necessary performance win.

4 - While I appreciate not changing my code, it actually was meant to be changed! I thought there was plenty of room for improvement.

Any chance you will be able to build from here into a working pull request given the points above?

Cheers,
Nigel
Developer
Jun 20, 2014 at 2:15 PM
Also, looking at the diff, it appears you changed the parse command to read the entire file at the same time, which will cause an out of memory error. C# defines enumerators and the yield keyword to avoid this:

http://msdn.microsoft.com/en-us/library/9k7k7cf0.aspx

That is, you can return the contexts one at a time instead of as a giant list, for which there will essentially never be enough memory for ~1000s of genomes.
Developer
Jun 21, 2014 at 6:25 PM
The semester will finish on Wednesday, so I'm working at the moment on my documentation (but got already pretty far with this).
Afterwards, I'll try to get as far as I can with the implementation. But I'm pretty optimistic to get some things done by then.

The test data is unfortunatley that big cause I used the data from the BroadResourceBundle. But I started using smaller data to be able to test the results as well.

And I'll have a look at the "yield" command as well - this seems pretty cool to me...
Developer
Jun 24, 2014 at 8:05 AM
The current progress of my work:
  • I updated the test cases with smaller test files and have better expected test results (based on the smaller input). I also deleted the big files. There is only one large file to simulate one "original" file
  • Unfortunately I had to upload the test files seperately. I was not allowed to upload them all at once (cause they were too large). That's why there are so many same labels. My recent commits should be better.
  • I removed the "unsafe" code flag and changed the code accordingly.