Protein Data Bank Files

Sep 15, 2013 at 2:53 AM
There seem to be no parsers or work done towards structural biology formats, such as the protein data bank file.

Are there plans for this? Are they already implemented and I simply haven't seen them?


If there aren't plans and they are missing, then I may begin to implement these.
Coordinator
Sep 15, 2013 at 7:35 AM
Hi Calem - welcome. As far as I am aware - and I've just done a quick exploration just to refresh my memory - there is nothing like this existing in the libraries or indeed present in the discussions to date. Most of the work so far has been devoted almost solely to the traditional sequence focused view of the world, with the obvious richness and limitations inherent in those datasets. I think that structural biology formats would be a useful addition but we should have a chat about where they might fit in the hierarchy. If you have a specific application in mind, it might help to post a para or some bullet points telling us a bit about it.

That said, if F# is one of your favourite languages, then there are abundant opportunities to contribute on that side as well. Have a look at the post from a day or so back and see if any of that takes your fancy.

Whatever, welcome to the community and we'd be very pleased to help you get started and contribute.

Cheers
jh
Sep 15, 2013 at 7:37 PM
Hey James, thanks for your quick response! Note before you read that I'm completely new to CodePlex, I've joined it specifically for this project, and I may be slightly clueless concerning CodePlex style and Net Bio because of this. I've not previously used the Net Bio library, but I have now reviewed some of its source and its organisation.

Looking at where they may fit in the current hierarchy, I've worked previously with the BioRuby, BioJava, and BioPython libraries, all of which come from the same group, and these solutions generally make a new package for all structural stuff called.... structure (go figure). The best organised and completed of these libraries, BioJava, has a structure package in org.biojava.bio. You can view this package here: http://www.biojava.org/docs/api/org/biojava/bio/structure/package-summary.html
The organisation of Net Bio appears to be somewhat similar to BioJava.

A Protein Data Bank file parser could be implemented under Bio.IO.Pdb; the structural aspects of the actual PDB object could be under Bio.Structure. There could also be an implementation of Chain, Group, and Atom that would resemble the BioJava implementation to make it easier to switch.

While I do not propose that a port be made of BioJava, as I don't believe the way it structures some of its interfaces and other members is appropriate, this example may be a good one to follow. It also lets me work on this privately and then integrate the code as a new package with its own unit tests.
I must admit that I'll need to get a little used to working on a larger project in .NET, which I've previously done only on the JVM, and I don't think it's a good idea that I touch any of the core code until I'm more used to C#'s style in projects (which so far seems to differ little from plain Java).

For looking at F# type providers, I would be happy to help implement these, but I would not be able to work on this until approximately June. Sadly, I'm a student, and I'm not nearly as available as I'd like to be this semester; the reason I would be implementing a structure package would be that I need to make one anyway for a research project involving structural analysis. I've previously used BioRuby and BioJava, but I'll be working on a cluster and using a proprietary package built in the lab in Visual C++. I'd be happy to contribute some hours to F# type providers and other technical details, but this semester I will be working first and foremost on getting the results I need for publication :-/ (I may also end up using existing, slightly buggy C++ code if time gets tight, but I'll try to contribute what I can here for structural biology).

Also, were a structure package to be implemented, would it be preferable for this package to be in C# or would F# be equally appropriate? Or I could implement F# temporarily (for sake of speed on my part) and re-implement in C# later.
Coordinator
Sep 16, 2013 at 2:52 PM
I think this would be great - I actually have some PDB parsing code I could contribute as a starting point, the problem is we don't have any structural objects (as you noted). I think, generally speaking, C# is the preferred language - just because it is the most common language used in .NET programming (everything else is written in C#), and also because it doesn't have any runtime requirements (additional assemblies you must deploy). F#, while awesome, is less widely used and understood. I'd be happy to help you in any way you'd need to get started or in translating from F# to C#. I think the first step would be in designing the data structures which the parsers would produce. Perhaps start a new discussion topic?

mark
Sep 16, 2013 at 5:01 PM
Is your PDB Parsing code in C#? That'd be very helpful to include it.

I've created a new discussion topic for the structure of the objects.
Coordinator
Sep 16, 2013 at 6:26 PM
Yes, it's in C# -- I'll have to track it down it was a few years ago that I needed it. But I'll move to the new discussion topic as well.
Coordinator
Sep 21, 2013 at 1:27 AM
Sorry for the intervening time. Crazy week.

So, my reading of the post above is that you don't yet have any F# code to contribute, but more that F# is an interest to come later, so C# is the good starting point. The IO namespace is obviously right for the parser, but we then need to move on the structures as indicated. I have a team of UG students who might be able to do some of this. It really depends on how much you want to lead and drive it, and how much time you can devote to it while still passing your degree :)

Let's continue the discussion. If mark can track down the code as a starting point, then this could work well.
Oct 29, 2013 at 1:01 PM
For what it is worth... there are a number of options for handling PDB files and structure data in .NET environments (see http://www.biostars.org/p/10592/ for a selection).

More importantly, due to the limitations of the PDB file format (http://www.wwpdb.org/docs.html) it is being phased out by the wwPDB in favour of the more flexible PDBx/mmCIF (http://mmcif.pdb.org/) and PDBML (http://pdbml.rcsb.org/) formats. Since the PDB data is available in these formats from all of the wwPDB members (RCSB, PDBe and PDBj) and the newer formats provide improved granularity as well as supporting larger structures, it would likely make sense to base support on these in the first instance.