New Sequence Aligner algorithms

Coordinator
Apr 8, 2013 at 11:25 PM
Hi All,

As you probably have seen in the forums, .NET Bio 1.1 will include a new implementation of the Smith-Waterman and Needleman-Wunsch aligners. This is because the existing code in .NET Bio didn't implement the algorithms according to the original specification and we wanted a standard implementation to compare to other platforms. The existing versions are still in the code (renamed to LegacySmithWatermanAligner and LegacyNeedlemanWunschAligner respectively) but have been replaced.

The original reason I recommended this was due to a forum post last year: http://bio.codeplex.com/discussions/357684 which indicated some problems in the aligners. I couldn't fix those issues without breaking the implementation so I deferred them until this release where we could slide in some more standard implementations.

If you go back and read that original post, one of the issues noted was that the returning alignment metadata regarding the start/end offsets is confusing (or, in some people's opinion, actually wrong, but I'll let you decide). Specifically, the pairwise alignment reports a FirstOffset and SecondOffset but it is swapped so:

SEQUENCE_1 (14): GCCAAAATTTAGGC
SEQUENCE_2 (16): TTATGGCGCCCACGGA

With the older aligners, I get:

Score: 6, FirstOffset:0, SecondOffset: 8

Alignments and markup [ |=match :=similar .=mismatch ]:
 1 TTA      3
   |||
 9 TTA     11
Notice that the second offset is actually the index into the FIRST sequence, and the first offset is the index into the SECOND sequence. Not intuitive at all. The original reason for this (and documented in the code) is that these offsets were intended to be offsets into the OTHER sequence - but unless you know that, the usage is obscure at best.

Looking into other aligner implementations, it appears that the reported offsets are for the sequences directly - in other words, FirstOffset would normally be the index into the reference (1st) sequence, and SecondOffset would be the index into the query (2nd) sequence.

So, we changed them in the new versions of the algorithms, however this is a breaking change and it means if we do this, we also have to change the PairwiseOverlapAligner to have the same behavior. This is because the assemblers use these indexes to pull strands out, so either we need to change ALL of them, or NONE of them as the OverlapAssembler doesn't work right now because it's looking at the wrong indexes for these new aligners.

Because this will impact existing code, I wanted to reach out to all of you and see what you think. Should we keep the existing behavior - FirstOffset is the index into the second sequence and vice-versa, thereby maintaining backward compatibility, or change it (and all the aligners and assemblers in the library) in order to make it adhere to a more common approach?

Thanks for any and all discussion!
mark
Coordinator
Apr 9, 2013 at 4:31 PM
I've given this a lot of thought over the past couple of days, I think the best approach will be to maintain compatibility - there is a "Startoffsets" metadata record which actually records the ending trackback cell position (col,row) that can be used to determine the first/second offset properly, and then we can just leave the properties alone - even though they are a little confusing. That way, the existing assemblers and MUMMer/NucMer tools don't need to change. It also means any existing code that uses .NET Bio now will continue to work. If anyone has a strong disagreement to this approach, then please share!

mark
Coordinator
Apr 9, 2013 at 4:39 PM
Hi Mark,

I tend to agree. I retrospect, the offsets should have been the other way around from the beginning - but given where we are now, leaving them alone seems like the least disruptive route.

Simon
Coordinator
Apr 9, 2013 at 6:23 PM
Well I sympathize with not wanting to make breaking changes. That is something you just don't want to do if you can avoid it. However in this case I do see benefits in that future users will benefit by making the implementation more intuitive. This isn't something I feel so strongly about that I would insist we change it. I realize this affects lots of our utilities and for our existing base it will be a big change. I also know this will cause some extra work and modifications of tests etc. But we do have evidence that at least one user was tripped up by this.

I think it boils down to:

Is the breaking of existing work more of pain than the possible future pain caused by this? And is the effort to fix this justified?

If we had a clean slate (no existing base) how would you choose to implement this?
We can do things to mitigate it like including it in a FAQ, commenting it in code etc.

Thanks for your thoughtful approach Mark. I am interested to hear what others have to say in this regard.

Rick