Performance global alignment

Apr 9, 2013 at 1:16 PM
I have a base of approximately 47.000 sequences, and need to perform global alignment of all against all with the goal of removing those with a score above an X. How could I make this a viable way, because after hours was obtained as a result only 50 threads.

Thanks for the help

Matheus Franco
Apr 9, 2013 at 11:10 PM
Hi, Matheus,

That means 47,000 x 47,000 /2 = 1.1 Billion method call, if one call costs 0.1 sec, that will be 30,000 hours. Suppose you can call the method in C code with 10 times speed up, it would still be 3,000 hours. Will you consider using Windows Azure for this?

Best,

Dong
Developer
Apr 9, 2013 at 11:19 PM
I can't think of any good reason to do that many alignments and than only keep the pairs above X. Global alignments are VERY expensive. You need to pre-filter so you are only aligning against candidates that are likely to be above X. This would be very easy to do, see the HSP discussion of the blast algorithm for instance.

http://en.wikipedia.org/wiki/BLAST

Basically, hash your sequence k-mers, only attempt a global alignment against somewhat similar sequences. Also, make sure you are only doing N choose 2 alignments, and not NxN.
Apr 10, 2013 at 2:13 AM
Hi Dong,
   Thanks for the suggestion.
Best,
Apr 10, 2013 at 2:17 AM
Hi evolvedmicrobe,

I understand your proposal. The problem and that I need to remove sequences that are similar to the same fasta file, with the intention of having only unique sequences for generation oligos for a microarray experiment. What is the best approach I recommend?

best regards,
Apr 10, 2013 at 4:45 PM
So you are trying to pick as less possible to make oligo libraries. If there is no other clever ways suggested, I had a quick thought: how about you reduce the search space in runtime, and if you can do this with multiple threads, then the space should be reduced very fast, i.e. all similar ones will be removed from the space.

Best,

Dong