Too cumbersome?

May 27, 2014 at 2:16 PM
Edited May 27, 2014 at 2:21 PM
I have been playing around with .NET Bio for a while now.
At some point I noticed that it took me much longer to get even very simple things done than it does in other environments. I think the implementation of sequences as Interfaces and byte arrays may be advantageous for performance reasons but complicates everything.

Eventually I found myself to just work with strings anyway and ignoring and working around .NET bio writing my own little classes. It is kind of sad but not everyone is proficient enough to work at the same level. Libraries like this one have to be useful at every level to succeed. Better documentation may help to a certain extent. Try for example to find out how to quickly merge two sequences when you just start with this....
May 27, 2014 at 2:30 PM
This is good feedback - ISequence was chosen so we could carry other information around with the data (i.e. metadata, alphabet, etc.), but it would be advantageous to support implicit conversions to and from strings and byte arrays I think. It wouldn't be performant, but it would make it easier to work with. It would allow this:
SequenceAligners.NeedlemanWunsch.Align("AAA....", "AGTC....");
I'll look into that - thanks for the idea!

Regarding the quickstarts - we are starting a "cookbook" of recipes which should aid with getting started.

May 27, 2014 at 10:08 PM
I like this comment too. The hierarchy is awkward but valuable when you do stuff that is more sophisticated. That said, i found it easier than biojava for example. Overall i think the issue s making the boilerplate available in a copy and paste way, and then also perhaps a simplifying wrapper. Not sure about the latter.
May 28, 2014 at 9:01 AM
Edited May 28, 2014 at 10:15 AM
Thanks for your replies. I'm really a biologist and not a programmer but I guess I'm the perfect lab rat to see whether .NET bio works with people of lower programming proficiency. Yes a cookbook would help, especially if the jump to .NET Bio 2.0 will make some of the older information outdated. I find the PowerPoint presentations quite useful because they provide context but whenever they get to the point of actually providing useful snippets they jump to the next topic. These could be expanded as I also would recommend not spreading your documentation over too many places.

Todays suggestion from me for the cookbook: How to simply align two sequences and get the similarity score out.

var algorithm = Bio.Algorithms.Alignment.SequenceAligners.NeedlemanWunsch;
algorithm.SimilarityMatrix = new SimilarityMatrix(SimilarityMatrix.StandardSimilarityMatrix.AmbiguousDna);

var results = algorithm.AlignSimple(DNAseqitem1, DNAseqitem2);
string similarityscore = results[0].Metadata["Score"].ToString();

This doesn't work, what am I doing wrong. :)
May 28, 2014 at 4:06 PM
Hey Fibula,

Yes the byte array versus strings is a constant issue (for both performance and memory the byte implementation is somewhat needed though), I like Mark's idea of just providing wrappers around strings to avoid conversions and make it simpler.

I would actually stay away from AlignSimple, it doesn't use an affine gap which leads to subpar alignments. Just normal Align should work better. Below is a complete example going from string to alignment. You almost had it, but there was one more nested access needed.

You mentioned you are new to programming, out of curiosity do you know how to use the debugger to inspect local variables? This helps a lot in tracking down this type of information.
                     //make some sequences to align
        var longHapOrg = "TGACCCCGAGGG---CCGGG--------------CCCTCCCCA";
        var longHap = longHapOrg.Replace ("-", "");//remove gaps to get alignment
        //create the sequences
        var lr = new Sequence (DnaAlphabet.Instance, longRef);
        var lh = new Sequence (DnaAlphabet.Instance, longHap);
        //create an aligner
        var al = new Bio.Algorithms.Alignment.NeedlemanWunschAligner ();
        //Create a scoring matrix, these are the parameters from the program
        //BWA MEM and will tend to favor long exact matches, which performs 
        //well for many situtations.
        al.SimilarityMatrix = new Bio.SimilarityMatrices.DiagonalSimilarityMatrix (1, -4);
        al.GapOpenCost = -6;
        al.GapExtensionCost = -1;
        //now align
        var res = al.Align (lr, lh); 
        Console.WriteLine ("Score is: " + res [0].PairwiseAlignedSequences [0].Metadata ["Score"].ToString ());
        Console.WriteLine (res [0].PairwiseAlignedSequences [0].ToString ());

May 28, 2014 at 5:38 PM

I think this code example would be a great one to see if we could simplify. Of course, one way is to use the Nuget package sample code - it has a BioHelpers.Align methods which does your code ;-)

May 29, 2014 at 1:55 AM
I wrote similar code to Nigel last night in exploring this, and I have modified a little to use his alignment suggestions. The code is below, where I intentionally explored the lists a little. The results object is a list of IPairwiseSequenceAlignment. Each ipsa then has a member list of PairwiseAlignedSequences with the metadata associated with each of these, which in the end is rational enough, it is just an additional level for people to get their head around. Helper code is probably about simplifying the setup and the access to the sequences and metadata at the end. Not sure the second of these is easy as long as you have a list of alignments - seems always to require some index based access. The ToString() on ipsa is sensibly giving the sequences.
            Bio.Sequence dna1 = new Sequence(Alphabets.AmbiguousDNA, "ACTGAAGGATATTA");
            Bio.Sequence dna2 = new Sequence(Alphabets.AmbiguousDNA, "ACTGTCCTAGATATTA");
            var algo = new Bio.Algorithms.Alignment.NeedlemanWunschAligner(); 
            algo.SimilarityMatrix = new Bio.SimilarityMatrices.SimilarityMatrix(SimilarityMatrix.StandardSimilarityMatrix.AmbiguousDna);

            algo.GapOpenCost = -6;
            algo.GapExtensionCost = -1;

            var results = algo.Align(dna1, dna2);
            Console.WriteLine("Pairwise Alignment: " + results.Count + " result entries");
            foreach (IPairwiseSequenceAlignment ipsa in results) {
                //Note equiv: ipsa.ipsa.PairwiseAlignedSequences[0].ToString()
                Console.WriteLine("Processing Pairwise Alignments: " + ipsa.PairwiseAlignedSequences.Count + " entries");
                foreach (PairwiseAlignedSequence pas in ipsa.PairwiseAlignedSequences) {
                    Console.Write("Alignment Score:" + pas.Metadata["score"].ToString());
Results of this code below:
Pairwise Alignment: 1 result entries

Processing Pairwise Alignments: 1 entries
Alignment Score:44
May 29, 2014 at 1:56 AM
So, fibula13, we have sort have hijacked the thread a little, but what other stuff are you working with?
May 29, 2014 at 9:41 AM
Thanks for the info, snippets like those are very useful. I think I initiated the hijacking of the thread myself.

As I said I'm your biologist turned low proficiency bioinformatician. I use all sorts of software from Vector NTI Designer and Geneious and its Java plugins up to Bioperl and co. All in a very duct tape wallet sort of style to get me what I need.

I'm more interested into synthetic biology and it is clear that .NET bio is more geared towards generating classes useful for genomics and handling of NGS data ( A conscious choice or a consequence of the mix of people driving it?).

I prefer the C# and VS environment and .NET bio seems like the only good library for molecular biology I could find and looks like it is alive and evolving.
Jun 16, 2014 at 10:30 PM
I added this into the Cookbook.
Jun 16, 2014 at 10:32 PM
Looks GREAT!!