What Next?

Coordinator
Mar 4, 2014 at 8:57 AM
I've been having some chats with evolvedmicrobe and have thus been thinking about some next steps. The work from Amber Weightman late last year is an obvious focus for us and potentially for others, but I have a small group of people talking to me at present about projects. This isn't an army of contributors, but they seem pretty sharp and focused and so it is a good time to pin down some of the next priorities.

In no particular order, these foci may include:
  • DDRadSeq and related work (the Amber project mentioned above)
  • parsers, parsers and more parsers
  • pattern search tools
  • additional web services
  • better support for phylogenetics
All thoughts welcome at this stage.

cheers
jh
Developer
Mar 4, 2014 at 3:51 PM
Hi all,
Am hoping to post something longer here in the future, but in the meantime, wanted to quickly post some other things Jim and I discussed while I have a moment.

Some ideas:
  • VCP Parser and “layered over” quality tools I am discussing this with Mark. The idea would be to have a toolset capable of reading VCF files and outputting them in several formats after passing them through various QC filters.
  • BGZF/Tabix information – Large genomics data files are often indexed in a bgzf format referred to as TABIX, http://bioinformatics.oxfordjournals.org/content/27/5/718 it would be nice to implement this generic interface for querying data from flat files (including VCF files as mentioned above).
  • Faster decompression/compression for parsers/data formats. The data is getting large and decompressing it can take a very long time, though storing it uncompressed is a total waste. Google has an algorithm, SNAPPY, http://code.google.com/p/snappy/ , that is supposedly an order of magnitude faster than zlib and it would be create to be able to store/decompress this framework as well. It might be nice to adopt the BGZF idea of “blocked” zipped file sections that can be randomly accessed, but using snappy instead of zlib to do the actual zipping. This could make the BAM/Zipped-VCF parsing much faster.
  • Ploidulator. It would be nice to push this more so it was a more complete toolset that was available for download by others, to examine quality metrics etc. It is nearly there, but would be nice to expand on.
Cheers,
N
Coordinator
Mar 5, 2014 at 11:01 AM
The other thing that we have been discussing today is the possibility of including some greater support for phylogenetic inference. We have tree parsers and formatters and various incarnations which implement the interfaces. As a result of algorithms such a MUSCLE we have some good supporting infrastructure through kimura distances, similarity matrices etc. Some of these can be usefully adapted or used directly if we want to undertake this sort of work. Any thoughts on the utility? We have direct use for some of this stuff, and so it makes sense to do some of it, but I'm still interested in comments. Thoughts?
Coordinator
Mar 6, 2014 at 5:09 AM
Nigel and/or Mark - Lawrence and I are talking with Thomas, a student from Germany here for the semester. We are looking at whether to throw him at ploidulator, but we thought that the best starting point in getting him familiar with the library might be to throw him at a parser project. An initial stab at the VCF parser? I don't mind if it becomes the whole project as long as he can do something coherent and good in a semester - we would like to see this ready for mid year and possibly a next release? Thoughts?
Coordinator
Mar 6, 2014 at 7:04 PM
This sounds interesting - my two cents based on the discussion so far:

Adding basic functionality to the core libraries provides the most potential bang for the buck, in that the features added here can be easily leveraged in all apps built using .NET Bio. The downside is that adding a parser (for example) isn't 'sexy' and the value is not immediately evident to others. Since we would like to attract more community members, there is considerable value in doing things that get attention as well.

People should add whatever it is they are motivated to work on - either because they have specific expertise or specific needs. Ultimately it is better to have more people making contributions than it is they fit with any overarching strategy - so long as they broadly enhance .NET Bio (i.e., parsers of biology-related formats good, news reader bad).

Beyond that, if there is interest in basic enhancements I would suggest that the web service framework is long overdue for an overhaul; it could be more efficient, and it should be simpler to extend the set of supported web services. Every additional web service connector greatly increases the breadth of data and services accessible through .NET Bio.

If there is interest in an application using .NET Bio, my advice would be that it should meet the needs of a group of bench scientists you know. The more broadly useful and 'sexier' the app the better of course, but experience has shown that it is not intuitive to design an app that genuinely helps scientists do something - what is needed is a virtuous cycle of gathering use cases, prototyping and user feedback. Of course a further advantage of this approach is that you have a ready-made user community once you have finished, in the form of the lab of scientists who helped you meet their needs - and of course a ready audience in other labs with the same need.

Overall, I think the best next steps would be:
  • Extend the basic functionality to make .NET Bio more comprehensive and useful - this would resolve into numerous small-scale activities that could be suitable for student projects
  • Consider a more ambitious refactoring of the webservice framework (or revisit the subdivision of the library into .dlls, or similar). This work would be beneficial, but is unlikely to increase usage on its own
  • Develop an application that benefits scientists you know - this requires the most commitment and the most work, but would benefit the developer directly because they would have a user community, and would serve to demonstrate the value of building on top of .NET Bio.
Simon
  • Develop
Coordinator
Mar 6, 2014 at 7:22 PM
I think I've said it before, I'm all for revising the web service framework; there are new features in .NET 4 and 4.5 which would make that area so much easier to work with and consume. However, I recognize that the task is more of a framework issue and isn't likely to make .NET Bio a household name ;-)

mark
Developer
Mar 7, 2014 at 5:56 AM
Will have more on this later, but wanted to respond quickly before heading to bed.

VCF Parser - Jim, this sounds great, if the student was interested I think it could be a fine starting point, some preliminary code is here:
https://github.com/evolvedmicrobe/Bio.VCF , which they should feel free to completely drop, and the format is specified here: http://samtools.github.io/hts-specs/VCFv4.2.pdf

For phylogenetics, that is a fragmented and rather complex landscape... I might stay away from it for the moment given the difficulty of making anything sustainable. If one wanted to work with C# and build for the future though, I think wrapping the BEAGLE library could be useful (https://code.google.com/p/beagle-lib/). The fundamental problem is likelihood calculations on trees take forever, so being computationally quick is the sine qua non for a useful library addition. Right now the BEAST/Java platform is the best for this, but they don't have P/Invoke, so I think we could win there.

Simon, I love your ideas and think involving scientists and applied problems as a way to grow the library, am in discussions about this now and will hopefully have more soon.

Cheers,
N
Coordinator
Mar 12, 2014 at 1:35 AM
Ok, I think we are likely to run with Thomas working on a mix of the parser and the web services framework. Mark, do you have a feel for the effort involved in the web services framework revisions? We have also found some locals who have done a fair bit of radSeq in some large plant genomes and are interested in ddradseq, so this will certainly be an interesting time to push on ploidulator.

We also have one guy - Craig Maher, who appeared briefly here a year or so back - working on our genomics IR azure project. I have just been talking with him and I am going to get him to expose this as a service for .NET Bio on Azure as this develops, so this is something that will need to slot in with the work from Thomas and anyone else who wants to contribute to the web services side. And finally for the moment, we have a student team who are looking at local copies of genomic resources and again exposing these as a service. The idea here is that sometimes it is nice to have a targeted local copy of genomic resources and to offer more convenient access to some metadata and features than the vanilla NCBI services. This local copy discussion is one for another time, but again the web services framework is pretty relevant, and there may be some overlap.

cheers
jh
Coordinator
Mar 12, 2014 at 3:03 PM
Jim, that sounds great - the best possible solution is to have some apps that showcase the .NET Bio libraries and deliver real value to researchers, plus some enhancements to the libraries themselves to make it easier to build more (and more powerful) apps. It sounds like there are enough contributors to do a bit of both.

Since we have regular phonecalls we can keep track of progress, and I'll also sync up with Mark. Hopefully we can make some significant progress, then do another release.

Simon
Coordinator
Mar 13, 2014 at 3:46 PM
Hi Jim,

Yes, I actually even have a basic architecture sketched out; I think it could be done in 2-3 weeks of time for the platform architecture and a BLAST implementation to show it off. At that point we could just start adding other web service handlers for other popular online services. I'll see if I can write up the architecture. One point on this - I'd like to take advantage of the new .NET 4.5 features and make the services completely async - so no synchronous version at all. I just don't see the point in blocking services in 2014 :-)

mark
Developer
Mar 13, 2014 at 5:47 PM
Sounds like progress is being made, excellent.

Jim, has Craig started working with the parser at all yet? I spent some time cleaning up and improving the code this weekend, and am actually starting to think it is about 90% there for production usage (I now have it as fast or faster than the Java version to, while using significantly less memory, which is nice). It really just needs more code comments at this point and some unit tests. I have been talking to some people, and actually think a lot of very useful tools could follow from this. For example, a common task is to call variants in next gen sequencing using slightly different methods. One would then like to know how these two approaches compare. Given the parser and two VCF files, this would be super simple tool to implement, but very useful for many people. Might be a great warm up for Craig, and I think would offer some nice opportunities to integrate with ploidulator down the road.

I also think it would be nice to get some documentation for using .NET bio on Mac/Linux up on the website. Right now it works great with Mono/Xamarin studio if people just grab the nuget package, but I am not sure we mention that anywhere, so we are missing out on a lot of possible developers. On that note, biopython has a great “cookbook” (http://biopython.org/DIST/docs/tutorial/Tutorial.html) that is super useful for those trying to learn the code. I think it might be nice to do something similar for .NET Bio. We already have some AWESOME documentation as word documents, but was thinking of setting up a place where people could go to get quick code snippets (how to load a fasta file, how to align two sequences, etc.). I am not sure if we could setup a wiki page or something, but if so we could drop code into it through time.

Some other thoughts on the next release. Agreed it would be nice to have the web service handlers be asynchronous. My sense though is most people at present wouldn’t really do anything with the freed up thread while it wasn’t being blocked though, as I don’t know too many GUI components that use the library (multiple threads are just now becoming common in bioinformatics programming). Would be nice to clean up the interface though.

I am also not totally comfortable with our Nucmer/Mummer implementation. I think it is basically functional, but it might be nice to fix the issues in the issue tracker before the next release.

From my end, I am considering implementing a SNAPPY compression backend for data formats, there is some pure c# code for this here: https://github.com/jeffesp/Snappy.Sharp that supposedly works and that the author has agreed to share with us. I am going to run some benchmarks later, and if it is nice, might put it in the library. I say there is only a 10% chance of this actually happening, but if anyone would object to it, let me know.

Cheers,
Nigel
Developer
Mar 13, 2014 at 7:32 PM
Just did some quick snappy benchmarks here:

http://evolvedmicrobe.com/blogs/?p=253

In case anyone is interested, it seems to smoke on compression but not do so well on decompression (though this might be a problem with the benchmark).
Coordinator
Mar 13, 2014 at 8:29 PM
Interesting! Actually, I had missed your previous post about the speed of the BAM parser - thanks!
Coordinator
Mar 16, 2014 at 4:34 AM
Great collection of stuff Nigel. Thomas is the guy who was going to look at the parser and WS stuff, so I think that given your post - and mark's comments on the architecture for WS - we should move. Can you link to the current version of the code? I think that this makes a lot of sense as a starting point for Thomas.

The compression stuff is interesting as well. Unsure of how that fits into the mix for now. I'm meeting with various people tomorrow, so we'll take it from there. Issue is defining and then scoping anything built on top of the parser. Idea was to use the parser as a familiarisation exercise before jumping to the WS work. How well-defined are your ideas of additional tools?

cheers
jh
Developer
Mar 16, 2014 at 6:41 PM
Think the compression stuff was a wash... probably not worth pursuing for now.

The latest version of the parser is up on github, an example of the basic types of things we should be able to do with it and build on is here: http://vcftools.sourceforge.net/ (a perl library that we should at least meet). Will be excited to add this in!
Coordinator
Mar 18, 2014 at 5:15 AM
Ok, thanks - have flicked this to Thomas, who will sign up shortly. We will start him immediately on the VCF parser, and look at scoping that and seeing whether the total project fits in the time he has available. I think so, but will confirm.
Developer
Mar 19, 2014 at 1:10 AM
Excellent, and feel free to let Thomas know he should feel comfortable contacting me with any questions, etc.
Coordinator
Mar 24, 2014 at 4:47 AM
Hi mark
do you have that architectural description and any related commentary or rationale? Thomas is working on the parser and we are trying to plan out the web services tasks as well.

cheers
jh
Coordinator
Mar 25, 2014 at 2:40 PM
Yes, I just need time to formalize it. But here's the idea:
  1. We need to redo the interface design. It's too complex and not flexible enough for future web services. I'd like to see a base interface which is fairly broad (and it would be similar to what we have) but then have some more specific generic interfaces that can be implemented for specific data types. I'd like to support any style of service (SOAP or REST or ...). I'm happy to take this task on and have talked a little to Nigel about it.
  2. As part of (1), I'd like to re-implement the BLAST service, it provides a proof of concept and practical implementation we already have unit tests for.
  3. We then need other people to identify and create additional implementations for other services. Here is where I think Thomas can contribute greatly. With an existing model in place from 1 & 2, if we can pick 2 or 3 other great data services out there then this becomes an immediate value add to .NET Bio.
mark
Coordinator
Mar 27, 2014 at 12:16 AM
Nigel - sorry for the late post...been busy with the paid work :)

Anyway in regards to -
I also think it would be nice to get some documentation for using .NET bio on Mac/Linux up on the website. Right now it works great with Mono/Xamarin studio if people just grab the nuget package, but I am not sure we mention that anywhere, so we are missing out on a lot of possible developers. On that note, biopython has a great “cookbook” (http://biopython.org/DIST/docs/tutorial/Tutorial.html) that is super useful for those trying to learn the code. I think it might be nice to do something similar for .NET Bio. We already have some AWESOME documentation as word documents, but was thinking of setting up a place where people could go to get quick code snippets (how to load a fasta file, how to align two sequences, etc.). I am not sure if we could setup a wiki page or something, but if so we could drop code into it through time.

I think the FAQ section of the site or the documentation section would work for this.
If you follow the similar style of the FAQ I think I can get that up on the site for you - I think I still remember the wiki editing tools enough to do so.

Something like:
"How do I load a FASTA file?"
A......

Rick