Bio.Selectome

Developer
Sep 22, 2013 at 4:01 AM
Hi all,

So I did some work with Selectome this week, and wrote some interfaces to the database that might be useful to have in .NET Bio. A brief description of this library, and some examples in F# that use the R type provider is available here:

http://evolvedmicrobe.com/blogs/?p=153

I wound up writing the Bio.Selectome namespace in C#, since that was the rest of the library (and this was my first go at F#), so only the usage examples are in F#.

I think this might be a useful thing to have in .NET Bio, so was planning to contribute as the code is likely complete (I even wrote some unit tests). However, I wanted to take some surveys on some things.
  1. Is this useful? Should be contributed?
If 1 then....
  1. For whoever might do code review. Selectome returns trees that have a lot of useful metadata, I expose these as properties/fields that are available to the user. This tree class is basically a Bio.Phylogenetics.Tree class, with additional bits, but is not derived from Tree. As mentioned in the last discussion, I think it might be useful to make a ITree interface instead of the tree class. This would make it easier for people to dump trees from Selectome to other files for additional computations using our existing parsers. I can contribute this as a stand alone code package or after making a separate change that implements the tree interface, depending on if people think this is a good idea.
  2. I stole a set of webcache code from the F# world bank data provider. The idea is that if you are doing multiple queries of the database, then if that query was done in the last 30 days, you should not wait to download the data again. The webcache is in the temporary internet files folder, and I think is a win (who wants to re-download when opening the script again?) but obviously introduces some disk space requirements, and although is encapsulated by try/catch, may cause a lurking issue I haven't seen. How do people feel about caching locally?
-Nigel
Coordinator
Sep 24, 2013 at 3:55 AM
I think the webcache functionality is a no-brainer, so long as it will be possible to set the length of time a cached result would be returned instead of re-executing the search. 30 days might work for the World Bank, but less would seem like a good idea for faster-movig data.

Regarding the selectome trees - I have no objection at all. Jim pointed out there is a breaking change in your suggestion - I'm also fine with that, but would suggest leaving the old code in place but commenting it to indicate it is deprecated in favor of your new method. Would that work?
Developer
Sep 28, 2013 at 1:59 AM
Hi Simon,

Definitely agreed on the caching time, but ENSEMBL updates pretty rarely and Selectome (which ensembl is based on) takes even longer (the PAML runs they do take awhile on their cluster), so I think 30 days will be fine.

For selectome trees, no new method or anything, it just implements an interface, so the old code should be there directly anyway.

I might have some time to commit the Bio.Selectome namespace this weekend. I am thinking of either adding it as a separate project, (a la Bio.PamSam) or sticking it in the Bio.csproj file, if anyone has opinions let me know...


-N