This project is read-only.

Patterns and matching them...

Mar 4, 2014 at 9:49 AM
Some time ago one of our people wrote a fairly elegant pattern language - specified in XML - which unifies a whole bunch of other standard patterns: REGEX, Motifs, Gaps, simple Matches and so on. The language has now been ported to sit on top of .NET Bio and we will launch both the codebase and a web based app in the coming days. Public site is under construction.

Some blurb follows:
BioPatML.NET is an application library for the .NET framework which integrates the BioPATML pattern definition and search engine with the .NET Bio bioinformatics library.
BioPatML is an XML-based pattern description language providing support for a broad range of component patterns, and a rich grammatical structure for their combination. The language defines a common representation for patterns which may be used to describe biologically significant sequences and sequence structures, including motifs, position weight matrices and regular expressions as well as hierarchical structures containing sequences or sets of arbitrarily complex patterns. Pattern iteration, repeat and Boolean operators permit the construction of patterns with much greater specificity than that provided by regular expression matching. The language provides an elegant mechanism for the definition and reuse of named sub-patterns, enabling the construction of pattern libraries which may be used to build concrete pattern instances.

Why this is cute is that one can specify hierarchical structures of patterns. .NET Bio grabs the files and does the actual parsing of the sequences, and we can get all the matches we want. An example specification for the three components of the promoter for sigma 70, the housekeeping sigma factor in _E. coli. _
<BioPatML xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
          xsi:noNamespaceSchemaLocation="BioPatML.xsd">
    <Definition name="sigma70" >
        <Definitions>
            <Definition name="-10element" >
                <Motif motif="TATAAT" alphabet="DNA" threshold="0.7" />
            </Definition>
            <Definition name="-35element" >
                <Motif motif="TTGACA" alphabet="DNA" threshold="0.7" />
            </Definition>
            <Definition name="spacer">
                <Gap impact="0.2" minimum="15" maximum="21" threshold="0.0" />
            </Definition>
        </Definitions>
        <Series mode="BEST" threshold="0.0">
            <Use definition="-35element"/>
            <Use definition="spacer"/>
            <Use definition ="-10element"/>
        </Series>
    </Definition>
</BioPatML>
So here we can combine the hexamers at -10 and -35 with a specified gap between them. Far more sophisticated versions are possible. The port is the result of some very good work by an Indonesian student called Lalu Yazikri, and we also have a nascent web based JS editor to allow construction of these patterns (full credit to Sadeen, our summer student from Saudi Arabia who has done a fine job of starting that work).

Anyway, these thoughts and images will whet the appetite, with more to come and plenty of scope for contribution. Drag and drop to create a pattern:
Image

Then run and list the results:
Image

Enjoy, and watch this space.

cheers
jh