orthoXML parser and formatter

Jan 18, 2012 at 7:48 PM

First, just wanted to say thanks to Rick and Mark for the training session last month.

Working with orthologous I have come across the fairly nascent orthoXML format. It only has a parser and formatter for Java, but the format appears to have some support, particularly from the orthologous databases on the web. Is a formatter/parser something that would be a meaningful contribution to this project?

 

 

Coordinator
Jan 19, 2012 at 6:14 PM

IMO is there any downside to including this? We want to be supportive of as many formats as possible to make .NET Bio have the broadest appeal. So by all means nascent or not if you have the inclination to make a contribution we want this. Glad you enjoyed the training. Looking forward to seeing your efforts :)

Coordinator
Jan 24, 2012 at 5:37 PM

Yes, we'd love to see this as a contribution!

We would be happy to walk you through the contributions process, if you need help.

Jan 24, 2012 at 5:49 PM

I haven't had time to look through the contribution docs, but I will this week.

My first question is regarding generating the class file from the xml schema using the XML schema definition tool xsd. Is this the best approach or should it be more hand tailored?

 

Coordinator
Jan 25, 2012 at 5:16 PM

Hi.

The short answer is that it's really an implementation detail - the .NET Bio parser/formatter design exposes and manipulates ISequence elements.  How you decide to internally load the file (XML) is really just a coding detail and is up to you.
The longer answer - as you noted, XSD.exe takes an XML schema file and generates a simple class definition which maps the XML element/attribute data onto public properties.  You then use the XmlSerializer type to serialize/deserialize the generated class to/from an XML file which conforms to the schema.  It's a perfectly fine approach to doing that, and you can even combine that with partial classes to add additional bits of data or behavior to the class.  The downside to this approach is that on loading, the XmlSerializer reads the entire graph into memory - i.e. it parses and creates all the objects - returning a collection of instantiated objects.
Personally, when dealing with XML, my goto technology is LINQ to XML.  I can then serialize/deserialize to my own implementation (or even into someone else's implementation such as a Sequence).  In addition, it has the advantage of defer-loading the XML data - i.e. I can read enough of the XML to generate one object and then return it for processing without loading anything else.  This is a key benefit when loading really big alignments.
Back to the .NET Bio case.  As I mentioned above, the .NET Bio parser/formatter exposes and manipulates ISequence elements, the parsing of XML really becomes an implementation issue - you wouldn't be exposing the POCO (or Linq elements) directly - you would intermediately generate the object graph using the XmlSerializer and then turn around and walk through the read collection of objects to generate sequences.  As such, using the above approach (xsd) is probably going to be a bit heavier because you would end up reading the file into memory and then generating ISequence elements from the in-memory items (since XmlSerializer reads it all into memory).  The Linq to XML approach would (most of the time) provide the ability to defer loading the entire file - only reading what is necessary to generate a single sequence.  Now, that's subjective because the file format might require significant parsing to get the data - i.e. some odd format where the sequence is defined up front and then the data follows later would require more parsing into memory.
If the file format has a distinct limit of elements (i.e. it always expresses on sequence) then I don't think it's an issue.  But if it's an alignment format, it might become one for larger formats in which case you might look at something like LINQ to XML (or even just a raw XmlReader) which would allow you to parse the file in stages.
Hopefully that helps,
mark