.NET Bio 2.0

Coordinator
Apr 21, 2014 at 12:26 AM
Jim, Simon, Nigel and I have had a couple of private emails about .NET Bio 2.0. I'll let Jim append to this discussion about the scope of the work since he's monitoring it, but I wanted to let everyone know that I will be creating a new branch of the code to host the next release since it will have quite a bit of refactoring in it.

The current timeline is to release it towards the end of June, so any changes between now and then should be moved to this new branch once it's created. If it's a bug fix, then let's plan on pushing into the 1.x branch as well for people who want to stay on the stable release trunk but still get any bug fixes.

Feel free to throw any questions or comments onto this thread as well -

Looking forward to working with everyone towards another great release!

Mark
Developer
Apr 21, 2014 at 1:44 AM
Edited Apr 21, 2014 at 1:45 AM
Mark,

Excellent, this sounds great! Just as a quick pie in the sky idea related to this. A lot of bio folks out there are working with github, and (probably due to my own expertise with git but not TFS), personally would find keeping changes between my branch and the server branch much easier if we were using the git system on codeplex.

I think (though could be wrong) that a current transition to git would be too difficult, but does anyone know if the git integration into the visual-studio/codeplex ecosystem is stable enough yet, or if there are things we could do in this release that might position us favorably for the a git transition in the future?

Cheers,
N
Coordinator
Apr 21, 2014 at 1:33 PM
I'll check into it, last time I looked, you could choose Git when you created the project, but there was no way to change it after the fact. We could create a separate repository for development and then push it back into TFS as a workaround - but that means pulling it from here and putting it somewhere else while it's being developed...

mark
Coordinator
Apr 21, 2014 at 1:34 PM
It looks like codeplex will allow us to convert it - but it's a manual process and it affects existing source code. Do we want to do that?

https://codeplex.codeplex.com/wikipage?title=CodePlex%20FAQ#TFStoMercurial

mark
Developer
Apr 21, 2014 at 2:48 PM
Well, it sounds like this would be feasible! I for one would be in favor, not sure if we would have any cause for concern related to this. How do others feel?
Coordinator
Apr 21, 2014 at 5:17 PM
I'm not a user of the repository and so I don't have a strong opinion either way - my concern is to ensure that the existing tools in use on the project (Visual Studio specifically) will continue to work without issue.

Does anyone feel able to evaluate the impact of this move? Will unit tests still work? Will it still be as easy to build installers? Will there still be at least one free (or close to free) IDE that works with .NET Bio? What functionality will we lose, and what gains can we expect?

This is a small community, and while that limits the extent to which breaking changes are an issue, it also limits the size of the pool of volunteer resources we have to draw on - and so we need to consider moves like this carefully to make sure we aren't biting off more than we can chew.

Simon
Coordinator
Apr 21, 2014 at 5:28 PM
I'm comfortable making the request.

Git actually provides a nicer way to manage multiple user contributions because of the way they force you to interact with the source code. You "pull" the source code locally, make changes and push to your local repository and then can "push" changes back which can be verified and merged into the main trunk. It is very similar to shelvesets in TFS, but kind of enforced. The nice thing is you get local source control so you can make a bunch of changes, keep them isolated from the global branch and then push all of them at once with comments and get someone else to verify them and merge them in - keeping comments and such. That's a lot harder to do with shelves and TFS.

The only issue I think we might have is that work items aren't pulled into Visual Studio - i.e. when you push a change, you can't associate the change with a specific work item since that's TFS and the built-in VS support for Git doesn't expose work items. This isn't a show stopper to me - we can just close the work items manually in the portal once a change is merged in.

Pretty sure all the VS 2013 SKUs have support for Git already so no issue there. Also, I think it's actually a benefit to the community outside Microsoft since Git is far more prevalent there, so from a usability perspective I think we'll find more people use Git than TFS anyway.

My .02$ .. anyone else have an opinion?

mark
Developer
Apr 21, 2014 at 5:33 PM
I agree with Mark, and think this would be a good idea. Not only would managing branches be easier (I am suffering to do this with TFS and basically have two complete installs on my computer), but we would also make it a lot easier for people to contribute since we are probably using a version control system they know and is also integrated with Xamarin.

For unit testing I still think people would need Visual Studio, but that is the case now anyway so no loss there.

Also, it looks like one can integrate with the issue tracker a bit (see: https://gitscc.codeplex.com/discussions/400997).

Nigel
Developer
Apr 22, 2014 at 5:23 AM
Edited Apr 22, 2014 at 5:34 AM
As a demo run, I just took a codeplex project controlled by git, and tried to go through a new contributors workflow if they were using the project. Steps were cloning code -> branching to make new feature ->submit pull request -> merge. I also tried some other variations.

In short, I totally think the new Git integration in visual studio is awesome. I played with it for about 10 minutes and it was far and away my preferred way of navigating git history now. I had tried it when it first came out... some bugs... but really it was seamless now. It took me a few days to get started with this project with TFS, I think git will be vastly better for new people.

Codeplex is not fantastic at supporting everything possible, but unlike github, when you want to do something on the web but can't, it lists all the git commands to do it in the shell right there, which saves a ton of time if anyone is not used to git.

I will be the first to admit I always thought TFS had more features than I was using, but I think git could be great for the project.

From my evaluation, the pros really outweigh the costs. As Mark mentioned, the only real downside is not having issue tracking directly resolved. However, presumably the issue should actually be closed by whoever reviews and accepts the pull request, plus a commit can be associated with an issue, even if it doesn't resolve it, so not such a big deal there.

Perhaps someone from QUT or Rickbe still wants to weigh in, but Mark, provided no one speaks now and forever holds their peace, should we go ahead?
Coordinator
Apr 22, 2014 at 5:35 AM
I think it is safe to say that we will be fine with a transition to Git, especially given what appears to be a good level of support in Visual Studio.

Cheers, Lawrence.
Coordinator
Apr 22, 2014 at 3:39 PM
Ok, then I'll get the ball rolling on that!
Apr 23, 2014 at 7:01 AM
Although it has been a while since I have done much with .NET Bio, my $0.02 is that git is getting a lot of traction and making the transition here is a reasonable plan.

I will admit on the projects where we are using git today, I still find it a bit 'different' after my 20+ years of a 'more traditional' source control system. The ease with which a multitude of branches are being created and passed about gives good isolation for an individual, but it can be 'confusing' when people bring all the branches together without losing things. And having to switch back and forth between git and tfs on different project is probably adding its own bit to the confusion. :-)
-bobd-
Apr 23, 2014 at 8:17 AM
Also, if you are new to git and using it from visual studio, this link might help
http://www.microsoftvirtualacademy.com/training-courses/using-git-with-visual-studio-2013-jump-start

-bobd-
Coordinator
Apr 27, 2014 at 8:55 PM
Edited Apr 27, 2014 at 8:56 PM
Our repository is now set up for GIT. Let's GIT'r done.. ugh.. bad joke :-)
Developer
Apr 28, 2014 at 3:32 PM
woo hoo!
Coordinator
Apr 30, 2014 at 6:00 AM
ouch. Sorry for the lack of responsiveness - been travelling. back home soon, so will post then. The bar is low for gags, but I will try to get under it.
Developer
May 5, 2014 at 11:10 PM
Edited May 5, 2014 at 11:10 PM
Hi All,

So I am trying out the git version. So far, this is great, it is a lot easier to add files for testing data.

One thing I can now notice, many of my files failed to work for testing because our testdata normally lives at
bio\bio\Tests\TestData\TestUtils
But is duplicated in other places, e.g.
bio\bio\Tests\Bio.TestAutomation\TestUtils
Often times there is the directory structure, but no data. Does anyone know what these folders are there for? It's confusing to me to have all these empty folders, but suspect there may be a reason. If not, can we remove them?

Also, I am going to commit a .gitignore file. Visual studio seems smart enough to not need this, but think it will help when I am on linux. If anyone wants to validate my changes, would be appreciated.

Cheers,
N
Coordinator
May 5, 2014 at 11:47 PM
Go ahead and check in a .gitignore, I have one as well - but I'll merge any VS-specific and Xamarin-specific stuff into yours.

The empty duplicate directories are old - the test data was all merged into one folder in 1.1. We used to repeat it everywhere but I tried hard to push it all to one place and then link it in the other projects.

mark
Developer
May 5, 2014 at 11:55 PM
Hey Mark,

Sounds good, I just committed my .gitignore file, feel free to modify!

And yeah, the test data looks a lot cleaner, I can make sense of it now! For some reason the directories are still there but empty, and are not showing up on git, they are pretty harmless but if you can figure out how to remove the empties in:
bio\bio\Tests\Bio.TestAutomation\TestUtils
on a fresh clone, it would be great.

I will probably be working on the repository for the next few minutes. right now a fresh clone is ~1 GB in size. About 40% of that is the sandcastle data folder, which has a lot of xml files that I am not sure if we can get rid of. Another 40% is the testdata, and I am trying to change the parsers to use the zipped files where possible, should drop us another ~100 MB in size.

The install itself is still incredibly small, but think it might be nice to lower to total size of the source trunk as well.

-N
Coordinator
May 5, 2014 at 11:58 PM
My plan is to branch off this for 2.0, I'd like to dump the whole install tree and just have a trunk with the source code (buildable of course), then a second trunk with the install/doc bits. This would shrink the tree considerably for most people who don't care about building the documentation or installer.

mark
Developer
May 6, 2014 at 12:42 AM
Edited May 6, 2014 at 12:42 AM
Ah, okay didn't realize/remember that was the plan. Though completely agree that is a great idea.

I removed the largest bit of test data and am done mucking with the repository for now. At some point I will need to merge in the affine gap penalty changes to the aligner, but will wait for your branch changes first (no rush at all on my end and wont' be able to do it for a bit).
Coordinator
May 8, 2014 at 1:16 PM
Nigel:

Go ahead and push your changes into the current tree - that way the fixes will be part of .NET Bio 1.1 (source anyway).

m
Developer
May 10, 2014 at 7:36 PM
Hey Mark,

I just merged the changes in as a SWAffineFix branch. I have now also merged that branch with the master one as well. Not sure what will be easier for grabbing 2.0.

There were a ton of incorrect unit tests that had to be fixed as part of this, which was something of a pain. I am not sure how much I trust the remaining tests, but hopefully this solves the issue.

Cheers,
Nigel
Coordinator
May 18, 2014 at 11:47 PM
One thing that will change slightly is the parser/formatter usage. We no longer have Open/Close directly on the parsers and formatters - instead, the underlying interfaces are all stream-oriented:
    public interface IParser
    {
        string Name { get; }
        string Description { get; }
        string SupportedFileTypes { get; }
    }

    public interface IParser<out T> : IParser
    {
        IEnumerable<T> Parse(Stream stream);    
    }

    public interface ISequenceParser : IParser<ISequence>
    {
        IAlphabet Alphabet { get; set; }
    }
Instead, the Open/Close/Dispose pattern is all part of a set of extensions, created specifically for the platform; so the desktop variety looks like this:
    public static class SequenceParserExtensions
    {
        class DisposableParser : IDisposable { ... }

        public static IDisposable Open(this ISequenceParser parser, string filename);

        public static IEnumerable<ISequence> Parse(this ISequenceParser parser);

        public static void Close(this ISequenceParser parser);
    }
These allow for "close" same-usage as 1.x, specifically, since the parser/formatter is not disposable anymore (it doesn't manage the file itself since files are a platform-specific abstraction, and not standardized), that means we can't do a "using" on the parser. So, we now return an IDisposable from the Open call - this can then be wrapped. So, here's an example usage:
   FastAParser parser = new FastAParser();
   using (parser.Open("4_raw.fasta"))
   {
      foreach (var row in parser.Parse())
      {
         Console.WriteLine(row);
      }
  }
I don't think it's a big deal - but I am letting everyone know that it will break existing code.. The only other alternative I thought of was to have an IDisposable which raised an event and then the extension methods (which manage the file) could catch that and close the file. I felt that was too odd, and easily misunderstood, not to mention a potential memory leak. So this is what I came up with.. comments?
Coordinator
May 19, 2014 at 12:11 AM
Edited May 19, 2014 at 12:40 AM
I like this design Mark.
The breakages are tolerable and the extension methods are a very clean way to implement a "parse file" capability without cluttering ISequenceParser implementations with (largely duplicated) file-oriented operations.
Coordinator
May 20, 2014 at 2:20 PM
Edited May 20, 2014 at 2:33 PM
Great! There are other changes coming as well.. for example, the GZip support is now a generic class which wraps an existing parser; it works basically the same as it did before, but is now usable with any parser, I included a default FastA/FastQ version which can be used without defining anything, but any parser can now be wrapped. And since it implements ISequenceParser, you get the Open/Close extensions as well..
public class GZipFastAParser : GZipSequenceParser<FastAParser>
{
   // Convenience definition
}

public class GZipFastQParser : GZipSequenceParser<FastQParser>
{
   // Convenience definition
}

public class GZipSequenceParser<TP> : 
        GZipParser<TP,ISequence>, ISequenceParser
        where TP : class, ISequenceParser, new()
{
   public IAlphabet Alphabet { get ... set ... }
}

public class GZipParser<TP,T> : IParser<T> 
        where TP : class, IParser<T>, new()
{
   protected readonly TP Parser;

   public bool CanProcessFile(string filename);

   public IEnumerable<T> Parse(Stream stream);

   public string Name { get { return this.Parser.Name; } }
   public string Description { get { return this.Parser.Description; } }
   public string SupportedFileTypes { get { return this.Parser.SupportedFileTypes; } }
}
The new CanProcessFile method looks for the .zip/.gz extension as well as backing up and walking through the parser's supported file extensions and ensuring it is valid. This only validates the filename - not the contents.
Developer
May 20, 2014 at 8:43 PM
Hey Mark, Am back from travels and was going to commit the genbank fix later. Given how the parsers are changing, should I hold off on this? Or branch from your fork and make a pull?
Coordinator
May 20, 2014 at 8:55 PM
Go ahead and commit it to 1.x; I'll merge it in since I'm adding those parsers now.
Developer
May 25, 2014 at 6:42 PM
Genbank fix just added.