As you probably have seen in the forums, .NET Bio 1.1 will include a new implementation of the Smith-Waterman and Needleman-Wunsch aligners. This is because the existing code in .NET Bio didn't implement the algorithms according to the original specification
and we wanted a standard implementation to compare to other platforms. The existing versions are still in the code (renamed to LegacySmithWatermanAligner and LegacyNeedlemanWunschAligner respectively) but have been replaced.
The original reason I recommended this was due to a forum post last year:
which indicated some problems in the aligners. I couldn't fix those issues without breaking the implementation so I deferred them until this release where we could slide in some more standard implementations.
If you go back and read that original post, one of the issues noted was that the returning alignment metadata regarding the start/end offsets is confusing (or, in some people's opinion, actually wrong, but I'll let you decide). Specifically, the pairwise alignment
reports a FirstOffset and SecondOffset but it is swapped so:
SEQUENCE_1 (14): GCCAAAATTTAGGC
SEQUENCE_2 (16): TTATGGCGCCCACGGA
With the older aligners, I get:
Score: 6, FirstOffset:0, SecondOffset: 8
Alignments and markup [ |=match :=similar .=mismatch ]:
1 TTA 3
9 TTA 11
Notice that the second offset is actually the index into the FIRST sequence, and the first offset is the index into the SECOND sequence. Not intuitive at all. The original reason for this (and documented in the code) is that these offsets were intended to be
offsets into the OTHER sequence - but unless you know that, the usage is obscure at best.
Looking into other aligner implementations, it appears that the reported offsets are for the sequences directly - in other words, FirstOffset would normally be the index into the reference (1st) sequence, and SecondOffset would be the index into the query (2nd)
So, we changed them in the new versions of the algorithms, however this is a breaking change and it means if we do this, we also have to change the PairwiseOverlapAligner to have the same behavior. This is because the assemblers use these indexes to pull strands
out, so either we need to change ALL of them, or NONE of them as the OverlapAssembler doesn't work right now because it's looking at the wrong indexes for these new aligners.
Because this will impact existing code, I wanted to reach out to all of you and see what you think. Should we keep the existing behavior - FirstOffset is the index into the second sequence and vice-versa, thereby maintaining backward compatibility, or change
it (and all the aligners and assemblers in the library) in order to make it adhere to a more common approach?
Thanks for any and all discussion!