Objects to Abstract Protein Structures

Sep 16, 2013 at 5:46 PM
Edited Sep 16, 2013 at 6:02 PM
The OpenBioinformatics Foundation uses this basic (slightly generalised) structure to express protein objects:
type Atom(element : string, coords : float[] ) =
    member val Element = element with get
    member val Coords = coords with get, set 

type Group(residuetype : string, atoms : Atom[]) = // used residue to distinguish from Amino Acid
    member val ResidueType = residuetype with get
    member val Atoms = atoms with get, set

type Chain(chainid : string, residues : Residue[]) = 
    member val ChainId = id with get
    member val Residues = residues with get

type Structure(pdbcode : string, chains = Chain[]) =
    member val PDBCode = pdbcode with gets
    member val Chains = chains with get, set
Their objects have a variety of extra getters and setters for PDB specific items such as atom occupancy. I suppose these would be necessary for a "complete" abstraction of a PDB file, but the most necessary items of a PDB file are the header, title, resolution, atoms and their locations, chain ID (with organised atoms), and a representation of the positions of the chains.

The other features, such as connections, can be implemented after the basic framework has been completed, and these shouldn't affect the other objects.

Is there a specific format in which people suggest the structures and members of objects here?
Coordinator
Sep 16, 2013 at 7:29 PM
I would put the most common things that everyone expects into properties and then add a Metadata dictionary - similar to what we have in ISequence. Also, I suggest we abstract this into interfaces with concrete implementations so people could always build their own abstractions if they need to (for virtualization, additional custom data, etc.) Does that seem agreeable to everyone?

mark
Sep 16, 2013 at 9:05 PM
Edited Sep 17, 2013 at 4:09 PM
A Metadata dictionary seems most logical to me. The way that the metadata has previously been handled makes it a little more difficult to navigate the API (in Biojava).

I have made the basic interfaces, but they should be reviewed. I do not know how to push changes to this project on CodePlex. I can post the code here, but I'm not sure that would be particularly helpful. They are, however, fairly small.
Coordinator
Sep 21, 2013 at 2:37 AM
Agree with Mark's comments. Why don't you expose some basic interfaces for comment, even on the forums to begin with, and we can have a full discussion of the design before we move forward. I will have some others look at this as well.

cheers
jh
Sep 21, 2013 at 7:07 AM
Edited Sep 21, 2013 at 3:31 PM
For a lack of a better place to post these, I will simply be posting the code directly in this window. The interfaces are basically the same as the Open Bionformatics Foundation's scheme, simplified a bit and made stricter for Element. Little issues like case need to be sorted out. Also, I'm unsure of how appropriate it is to use enums to do things like express types of residues or nucleotides, but these can be easily changes to a plain String if that's better in C#.

The base for the structures, Element, is strictly defined, and is thusfar ported from the Open Bioinformatics Foundation's rendition:

/// <summary>
/// The Element defines the properties of every element on the periodic table.
/// This class is largely ported from the Element in BioJava.
/// </summary>
class Element
{

    /// <summary>
    /// An enumerator to define the different possible types of elements.  "Unknown" can also be used as a "wildcard" element for a broader set of elements, such as the "R" element.
    /// </summary>
    public enum ElementType { METALLOID, OTHER_NONMETAL, HALOGEN, NOBLE_GAS, ALKALI_METAL, ALKALINE_EARTH_METAL, LANTHANOID, ACTINOID, TRANSITION_METAL, POST_TRANSITION_METAL, UNKNOWN };

    public int atomicNumber;
    public int period;
    public int hillOrder;
    public float VDWRadius;
    public float covalentRadius;
    public int valenceElectronCount;
    public int minimumValence;
    public int maximumValence;
    public int commonValence;
    public int maximumCovalentValence;
    public float atomicMass;
    public int coreElectronCount;
    public int oxidationState;
    public float paulingElectronegativity;
    public ElementType elementType;

    public Element(int atomicNumber,
        int period,
        int hillOrder,
        float VDWRadius,
        float covalentRadius,
        int valenceElectronCount,
        int minimumValence,
        int maximumValence,
        int commonValence,
        int maximumCovalentValence,
        float atomicMass,
        int coreElectronCount,
        int oxidationState,
        float paulingElectronegativity,
        ElementType elementType)
    {
        this.atomicNumber = atomicNumber;
        this.period = period;
        this.hillOrder = hillOrder;
        this.VDWRadius = VDWRadius;
        this.covalentRadius = covalentRadius;
        this.valenceElectronCount = valenceElectronCount;
        this.minimumValence = minimumValence;
        this.maximumValence = maximumValence;
        this.commonValence = commonValence;
        this.maximumCovalentValence = maximumCovalentValence;
        this.atomicMass = atomicMass;
        this.coreElectronCount = coreElectronCount;
        this.oxidationState = oxidationState;
        this.paulingElectronegativity = paulingElectronegativity;
        this.elementType = elementType;
    }

    public static readonly Element H  = new Element(1, 1, 39, 1.10f, 0.32f, 1, 1, 1, 1, 1, 1.008f, 0, 1, 2.20f, ElementType.OTHER_NONMETAL);
    public static readonly Element C  = new Element(6, 2, 0, 1.55f, 0.77f, 4, 4, 4, 4, 4, 12.011f, 2, -4, 2.55f, ElementType.OTHER_NONMETAL);
    public static readonly Element N  = new Element(7, 2, 57, 1.40f, 0.75f, 5, 2, 5, 3, 4, 14.007f, 2, -3, 3.04f, ElementType.OTHER_NONMETAL);
    public static readonly Element O  = new Element(8, 2, 65, 1.35f, 0.73f, 6, 1, 2, 2, 2, 16.000f, 2, -2, 3.44f, ElementType.OTHER_NONMETAL);
    ....
    ....
}

The atom builds on the Element. These two are distinct so that atoms can be created and manipulated and the Element class merely maintains static information for every element.
 /// <summary>
/// Implementation of Atom, making up one of the core sets of data structures for the Structure library in Bio.
/// </summary>
interface Atom
{
    /// <summary>
    /// Gets the element for this atom such as H, Li, Be.
    /// </summary>
    Element Type { get; }

    /// <summary>
    /// Gets the coordinates for this element, such as [1.242f,3.234f,1.342f]
    /// </summary>
    float[] Coords { get; }

    /// <summary>
    /// Gets the crystallographic B Factor or temperature factor of the atom.
    /// This value is commonly used for the atom, vital for analysis, and is thus available explicitly, outside of the metadata.
    /// </summary>
    float BFactor { get; }

    /// <summary>
    /// Optional parent to get/set for easy access to the Group reference to which the atom belongs.
    /// </summary>
    Group ParentGroup { get; }

    /// <summary>
    /// Gets the dictionary of metadata for this Atom.
    /// There are currently no standard metadata for this object.
    /// </summary>
    Dictionary<string, object> Metadata { get; }
}
The residue class extends the Group interface, making naming exactly the same as the Open Bionformatics Foundation's.
/// <summary>
/// Implementation of Group, making up one of the core sets of data structures for the Structure library in Bio.
/// </summary>
interface Group
{
    /// <summary>
    /// The most common group types.  Other should be used if the group of atoms does not belong to an Amino Acid.
    /// For a residue class this is: { GLY, ALA, VAL, LEU, ILE, MET, PHE, TRP, PRO, SER, THR, CYS, TYR, ASN, GLN, ASP, GLU, LYS, ARG, HIS, OTHER }
    /// For a nucleotide class this is: { A, C, G, T, U, W, S, M, K, R, Y, B, D, H, V, N }
    /// </summary>
    enum GroupType { };

    /// <summary>
    /// Gets the type of Group such as ARG, ALA, etc.
    /// </summary>
    GroupType Type { get; }

    /// <summary>
    /// Gets the list of atoms for this Group.
    /// </summary>
    List<Atom> Atoms { get; }

    /// <summary>
    /// Gets the chain number for this group.
    /// </summary>
    int ChainNumber { get; }

    /// <summary>
    /// Gets the parent chain for this group.
    /// </summary>
    Chain ParentChain { get; }

    /// <summary>
    /// Gets the dictionary of metadata for this Group.
    /// There are currently no standard metadata for this object.
    /// </summary>
    Dictionary<string, object> Metadata { get; }
}
And then Chain
/// <summary>
/// Implementation of Chain, making up one of the core sets
/// of data structures for the Structure library in Bio.
/// </summary>
interface Chain
{
    /// <summary>
    /// Gets the ID of the chain.
    /// This is usually a letter, such as 'A.'
    /// A PDB sometimes has several chains such as PDB Code 1BUY chain 'A' and chain 'B'
    /// </summary>
    string ID { get; }

    /// <summary>
    /// Gets the list of groups for this chain.
    /// </summary>
    List<Group> Groups { get; }

    /// <summary>
    /// Converts the PDB chain to the Bio ISequence
    /// </summary>
    /// <returns>ISequence of the amino acids in this chain.</returns>
    public ISequence ToISequence();

    /// <summary>
    /// Retrieves only the standard amino acids for this 
    /// </summary>
    /// <returns>List of residue groups.</returns>
    public List<Group> GetStandardAminoAcids();

    /// <summary>
    /// Retrieves only the nucleotide groups for this chain.
    /// </summary>
    /// <returns>List of nucleotide groups.</returns>
    public List<Group> GetNucleotides();

    /// <summary>
    /// Gets the dictionary of metadata for this Chain.
    /// There are currently no standard metadata for this object.
    /// </summary>
    Dictionary<string, object> Metadata { get; }
}
And, last, Structure
/// <summary>
/// Implementation of Structure, making up one of the core sets
/// of data structures for the Structure library in Bio.
/// </summary>
interface Structure
{
    /// <summary>
    /// Gets or sets the PDB Code for this protein structure such as 1BUY or 3IZ8.
    /// </summary>
    string PDBCode { get; set; }

    /// <summary>
    /// Gets or sets the chains for this structure, matched to their chain letter.
    /// </summary>
    Dictionary<string, Chain> Chains { get; set; }

    /// <summary>
    /// The resolution of this structure.
    /// </summary>
    float Resolution { get; set; }

    /// <summary>
    /// Gets the dictionary of metadata for this Protein.
    /// Standard metadata for this object include:
    ///   Title
    ///   Description
    /// </summary>
    Dictionary<string, object> Metadata { get; set;  }

    /// <summary>
    /// A method to convert the current protein into a PDB representation.
    /// </summary>
    /// <returns>A string of the PDB representation</returns>
    string ToPDB();
}
Coordinator
Sep 26, 2013 at 11:31 PM
seems a good start. The ToISequence method is better called ToSequence while retaining the ISequence return type. Can you clarify what comes back in the list of residue and nucleotide groups?
Coordinator
Oct 2, 2013 at 4:10 PM
Hi Calem,

I have several suggestions to the design which would bring it more inline with the .NET'ish style -- can you send me your source files which include the above? I could edit it and send it back for you to compare.. my email is mark at julmar dot com.

mark