Suggestions for the cookbook

Jun 6, 2014 at 3:36 PM
Hi there, another suggestion for the cookbook:

What is the quickest way to check a non-ambiguous DNA target sequence object for the presence of an ambiguous query sequence e.g. GGTRAG or YYYYYNYAG.

Now I know that I could do an alignment but then I would have to figure out what score corresponds to an absolute match. It is cumbersome.

Of course I could simply convert my target sequence to a string and use a string operation but that only works with non-ambiguous nucleotides.

I want something like
bool sequencepresent = DNASequence1.Contains(new Sequence(Alphabets.AmbigousDNA, "YYYYYNYAG"));

Maybe this exists and I'm simply missing it.
Developer
Jun 10, 2014 at 5:49 AM
Hi Fibula13,

Hope you are doing great! Apologies for the slow response on this, we all have day jobs (or between being willing to help friends and consulting work, several of them....), so sometimes are not as quick to respond as we would like.

In any event, just wanted to give you a ping to say I saw this, and haven't had the time to respond.

You ask an interesting question. I only really know some parts of the whole .NET Bio landscape (the ones I use for my job), so there might be a better solution than what I would propose, but to open it up to comments... for your first question, I think the library actually does implement ambiguous alignments with scoring (need to check), and one simply needs to give an ambiguous hit as many "points" as an exact hit, and it would fulfill your requirements. This could actually be done pretty simply in the SimilarityMatrix that is passed to the aligner (again, I vaguely remember, not validated).

However, the two sequences you mention are a real contrast. The first, "GGTRAG" is a degenerate representation of only two possible sequences (R can be A or G, all other characters are fixed). The second option you give, with 6 Y's (each Y=2 options) and 1 N (each N=4 options) can lead to 42^6 = 2^8 = 256 sequences. This is a couple orders of magnitude difference in terms of number of options, but in both cases a brute force enumeration would work well.

I think the real question though is the size of the sequence you are searching over. For the algorithms you are attempting, which are global and find the absolute best possible solution, they scale at order O(n
m) where n and m are the length of the query and reference sequence respectively. Now, if your reference is small, then one query, or 256 queries are irrelevant, and the globally-best-dynamic-programming alignments is a good choice.

However, if the reference is large (think vertebrate genome not bacterial), and you do this frequently, then the globally optimal solution is inefficient and you should use a hashing approach. A not-guaranteed-best-but-so-good-it-basically-is way to do this would be to query all, say 256 choices, against your reference using the .NET Bio MUMMER routine, which has a very efficient suffix array method. I did this recently when someone wanted to know where their CRISPR probes might edit the genome in an unfortunate place. So I can say, .NET Bio MUMMER implementation, fast and great! The main thing is that given a query of length N and a reference of length M, most algorithms scale with Max(N, M) or N*M, and it is important to have an idea of what N and M are ahead of time, lest brute-force-perfect eat all RAM and CPU time.

So to summarize, if your search sequence and reference is small you can
  • Create the SimilarityMatrix to score ambiguous as high as exact matches.
  • Enumerate all possible versions of the ambiguous sequence and align those.
And additionally, you can:
  • Think about the size of the sequence you are aligning against and how often you do the alignment, which will affect the optimal algorithm choice.
  • Wait for Mark to give a better solution than me, because that guy knows his stuff ;).
Cheers,
Nigel
Developer
Jun 10, 2014 at 5:50 AM
And after posting, just noticed you were only interested in exact matches (or so I think), in that case, can be quite fast.
Jun 10, 2014 at 10:03 AM
Dear Nigel

Many thanks, there is a lot of theory in your post and a lot of information.
But is it really necessary to understand all this for a simple check whether a sequence (its ambiguous so essentially a pattern) is present in another sequence or not?
There must be a simple way to do this. I can't say that I'm proficient enough to take the information above and using it construct a snippet that would do the trick.

fib
Coordinator
Jun 10, 2014 at 1:10 PM
I will post properly but I had a quick look at this over the weekend. What you are after is really a kind of string pattern search, which is available readily, but is complicated by the ambiguity. We have an extension library which helps with this as well, but obese trying to find a simpler method in the standard libs without resort to the alignments. As Nigel says, ignoring the order notation if you like, alignments are fine if the sequence is small but ugly otherwise. Mark or Lawrence?
Jun 17, 2014 at 3:49 PM
Yes a string pattern search, is that really "available readily"?
Actually a way around the problem would be a method that generates the set of all non-ambiguous DNA sequences from an ambiguous sequence.
May already exist, or I guess I could write that.

Good stuff with 2.0 by the way, time for me to get VS 2013
Coordinator
Jun 17, 2014 at 4:12 PM
I realize it's not a solution to what you are exactly trying to do, however I did add a cookbook entry for searching sequences as a result of this discussion: https://bio.codeplex.com/wikipage?title=boyer_moore_search
Jun 27, 2014 at 3:12 PM
Thanks for this. I played around with this and yes the cookbook entry is cool but it isn't quite what I needed.
When one searches a sequence pattern then it is usually the pattern which has ambiguous nucleotides and the sequence to be searched will be unambiguous.
There should be a solution that doesn't require one to convert sequences back into strings.
And the output should include (among other things) a basic contains/does not contain argument.