Performance Comparison?

Developer
Feb 5, 2013 at 4:56 AM
Out of curiosity, does anyone happen to have any comparisons of the relative performance of some of the algorithms in C# versus their equivalent C/C++ implementations? In particular for things like ABySS versus Padena or the MUMer/NUCmer tools?

I know the .NET Bio code I have seen is far more extensible/readable/modular/safe than the C/C++ code I have seen for a lot of bioinformatics stuff, so I gather is going to take a performance hit for such features beyond whatever is intrinsic to the virtual machine for all that data validation and type safety, but if anyone has done any tests I would love to hear about them.

Similarly, has anyone ever tested .NET Bio on Mono? I have tried porting a few GUIs onto Linux/Apple using mono and found the mono performance was somewhere between bad and just plain broken. However, for things I have done that lead to console/command line programs and don't use the windows APIs it has often been the case that Mono is faster than the MS CLR, so am also curious how general that is.
Coordinator
Feb 5, 2013 at 5:29 PM
Performance comparison has been difficult for us for two reasons - when this was a Microsoft-internal project, there were legal issues that prevented us from downloading and using many of the standard assemblers and aligners and hence we were not able to make direct performance comparisons. Now we are an open source and community-owned project, I have been hoping that others would perform these comparisons so they would be viewed as objective by the community.

Managed code is inherently at a disadvantage of course - one which is in part compensated for by the safety, extensibility and flexibility you mention, to be sure - but still slower. In most cases this is irrelevant, such as parsers and webservice connectors - but in the case of demanding applications such as assembly, I believe the best possible performance is about 3% slower than comparable unmanaged code. That said - I'm sure we are slower than that! We have had issues with performance versus scalability in PADENA - and in fact the current codebase (not the release but the source) contains a fix which should enable vastly better scaling (from 2bn objects to 2bn * 2bn) but in my testing I have found that the impact on performance is so great we have to back these changes out. I have yet to test this new, scaled-down version, but I'm planning to do so in the coming weeks.

Overall, I'd say that any performance comparisons you are able to conduct would be a very welcome addition to this thread - or given the importance of the issue, we can create a separate page on this site to discuss further. I have come to the conclusion that it is a case of using the right tool for the job; in its current form, PADENA is usable for regional assemblies and microbial-scale genomes where scale is less of an issue compared to the advantages of modularity, customizability, etc. If speed is the prime concern then an unmanaged code solution might be better - ALLPATHS is highly-regarded, or if you want an aligner then SNAP (written by my colleagues here at Microsoft in collaboration with the Berkeley AMP lab) is an order of magnitude faster than anything else right now and can complete a Human genome in about an hour - in fact the developers are shelving work on further speed optimizations because the bottleneck is now the speed at which the bus can deliver data to the CPU.

All the best,

Simon
Developer
Feb 5, 2013 at 8:20 PM
Hi Simon,

Thanks for the interesting reply! I had never heard of SNAP before, but an order of magnitude is a very impressive performance improvement. I also have to say that I am glad to see an aligner making good use of all the memory, it always seemed to me that a lot of the burrows-wheeler transform aligners were stopping themselves around 4 GB of RAM and putting a large computational burden on themselves with this memory limit, good to know that the intuition that increased memory usage can lead to improvements still holds.

As for performance, if I can find a saturday somewhere I will see if I can post anything. I am using padena now to assemble some mitochondrial genomes so will probably be doing some comparisons soon anyway. One thing I will say though, .NET Bio definitively smokes Biopython (obvious I know, but really it is a couple orders of magnitude). I am also a huge fan of the C#/Ironpython integration and am really hoping enough people see the light and get to try it out sometime :).

As for the storage class issue you mention in Padena, I gather a linked list or something is being used instead of a normal list to allow for the squared improvement in the size of storage? Is the code for the new storage class online? I can take a look at the switch you mentioned, not sure who is developing Padena now though.

Cheers,
Nigel
Coordinator
Feb 5, 2013 at 10:30 PM
Yup. We smoke BioJava too :-)

Of course the unmanaged apps are the real performance stars, and we have internal people working on this as well - for example FAST-LMM for large GWAS and SNAP I mentioned before. You make a good point about memory usage - 64GB is a lot to aski for the machine most people have on their desktops (today at least) but not much for any lab to afford. We have tested on 'super desktops' like this in the past.

The storage class solution we attempted was essentially the standard storage class in .NET, but taking a 'class of classes' approach - we knew we would sacrifice some speed and planned to implement a heuristic whereby once the input was over a certain size we would switch to the slower, but more scalable solution - but in the end the performance hit was too great. Personally, I'm favoring the unmanaged code solutions of my colleagues (and of course of the wider community - I'd far rather see someone get the job done than insist on it being exclusively on Microsoft or any other platform).

I take it you found SNAP (snap.cs.berkeley.edu) - FAST-LMM is on CodePlex along with a range of other Microsoft Research apps - there's a link on the main page of this site.
Coordinator
Feb 6, 2013 at 1:36 PM
This is a great thread. The comparisons - as they stand - are a great one semester project. There was a recent Bio java paper which did have some performance numbers - don't have the link handy but can dig it out. This and a few others would make rather a good set of test cases. That said, I'm not aware of a comprehensive set of benchmarks that are used across multiple communities. A set of basic problems is kind of easy to define, but there will always be some arguments over completeness. But if we can avoid religious wars it might be nice to identify a set of problems that might sensibly contribute to debate. Happy to offer the project almost immediately. Simon, I'm aware of some of the mummer related probs, but could we start with a list of analyses you wanted to do be felt you couldn't as hands were somewhat tied? My ideal test probs involve n vs. N-1 comparisons across whole genomes, but important to structure these so that we are not just measuring WS latencies. Nigel, any thoughts on must do tests? [Fyi the victims would be 2nd or 3rd year cs majors. Good project with lots of tech and an interesting application].

Cheers
Jh
Feb 6, 2013 at 2:25 PM
64 GB is still a lot even today. And, Windows Azure does not have this size, Extra-Large only gives 16GB. There is a benefit of constraint your program to use only 4GB, because then I can grab 1000 instances of Windows Azure Large VM to run 1000 BWA.exe each using 4 threads. Consider this 64GB machine might only have 16 real cores. For a program to success you not only need to excel on a single machine, but also fit a cluster node or the cloud vm.

The original FM-index paper was actually compressing the BWT using MTH coding, assuming extra CPU cycles are free in order to save space. Yet BWA paper instead was saving portions O and S arrays and calculates the rest on the fly. It will be very easy for those programs to just expand their memory space and run faster.

Currently there are only two breeds of short read aligners, you either SA/BWT or hash. Personally I feel doing hash is going backwards in computer science research. To give you an extream example, (since I'm a VLDB fans for many years), I could grab a SQL Server with 1TB memory and hold human reference genome in a big table, then I could do alignment with T-SQL statements, and could be very very fast, but what is the point? Who will be the user then? Designing a software is a systematic action, you not only need to consider your algorithm, but also the hardware and the user.

Given sequencers are emitting longer strings and people prefer sensitivity of indel/sv, all current tools are not perfect in that they only tolerant to a certain degree of gap opening or mismatch, after that their speed decrease to the level of Blast/Smith-Waterman. Thus trying to get away by indulging oneself with relaxed memory constraint is kind of like cheating. The very first aligner ELAND (part of CASAVA) was using hash already.

Real progress is doing something new.
Developer
Feb 6, 2013 at 7:56 PM
Well, to be clear no one is saying that the SA/BWT methods aren't a great innovation. I am not sure 64 GB is as much as a limitation as it used to be. Most clusters I use seem to have about 32 GB of RAM per node, going up to 192 GB. The new Mac Pro desktops come with up to 64 GB of RAM, and one can buy computers with far more memory than that for a lot less money at places like avadirect. It's just really hard to beat a 10X to 100X improvement, and I think it would not be too surprising if in the near future the standard cluster node memory size doubled to 64 GB. In any event, as you say BWA can speed up a lot (I presume) by storing more of the O and S arrays and computing less on the fly. But I think saying hash methods are backwards is a little too rough. BLAST-esque hashing algorithms have improved a lot since they first came out, are still very useful and often better for particular tasks, and I think the people who worked on that should get some credit for making progress :).

As for comparisons, I can definitely think of a few and would be happy to help develop some tests that anyone might want to implement. The problem is that as C#/Java span the domain between unmanaged and interpreted code, it's not clear to me that we would have a lot of other packages to compare to. Comparing to unmanaged code would be very interesting. I definitely think doing the Mumer/NucMer comparison and the assembly ones as well would be great. A comparison to Picard for BAM/SAM parsing and writing would also be very interesting, can think of more if it looks like this will be implemented to.
Coordinator
Feb 7, 2013 at 12:04 AM
In regards to Mono - here is the informatin from our FAQ (yes I know it's probably not intuitive as to being able to find it)
http://bio.codeplex.com/wikipage?title=FAQ&referringTitle=Documentation&ANCHOR#home

In two places there is a call to CredentialCache.DefaultCredentials. This is only used for NTLM, negotiate, and Kerberos-based authentication, so it can be ignored when those aren’t being used. But if you need them, there isn’t really any good workaround.

There are four calls to Assembly.GetName(Boolean), which has the rather uninteresting ability to change the Assembly.CodeBase when an assembly is shadow-copied.

Another security related feature being used is HttpTransportSecurity.ClientCredentialType from Windows Communication Foundation. Since this is just used for calling web services via WCF, an alternate web service layer could be used until Mono catches up.

In one of the add-in packages there are a couple of calls to the Win32 function GetTickCount. This merely returns the number of seconds since the system was last started, so it is rather odd that Mono doesn’t already have a translation layer for the Linux and OS X equivalents.

Rick
Developer
Feb 20, 2013 at 12:14 AM
Edited Feb 20, 2013 at 12:27 AM
A quick note on a performance comparison I just did. I tested assemblying a mitochondrial genome with either Velvet (C coded, but single threaded version) or PADENA on mono or .NET. Not a complete evaluation by any means, different algorithms, settings, etc. But results were.

Velvet - 1 seconds to assemble
PADENA .NET - 3.5 seconds to assemble (-i flag)
PADENA .NET - 5 seconds to assemble
PADENA Mono -11 seconds to assemble (note this was 32 bit version on windows)

I may draw this out more in the future, but for now am just verifying results.
Coordinator
Feb 25, 2013 at 2:09 PM
How big is the genome? This is just 10^4-10^5 bp? are we paying a set-up penalty here for the multithreads but not getting any return? The 11 second result seems ignorable at this stage, but the 1s vs 3.5 or 5 seems more interesting. Curious as to where you think it is spending the time.
Developer
Feb 26, 2013 at 8:13 PM
Edited Feb 26, 2013 at 8:14 PM
Hi Jim,

Yeah, I think the 11 second result is pretty ignorable. As for where the algorithm is spending its time, of the ~3-3.5 seconds it can take it seems the bulk of the time is spent either on creating the graph (1.7 seconds) or undangling the graph ( .41), or building the contigs (~.5 seconds).

My hunch is that the cost of starting threads from the thread pool probably isn't contributing that much, but rather it's some programming issues that might cause the difference. For example, without running the profiler, I think the link generating function of the graph (finding k-mers that overlap between nodes of the graph) is taking far more time than needed. This function takes a node's k-mer encoded as a integer, decodes it to a string, creates a new string by appending a letter to a substring of that first string, converts that new string into a byte array and then encodes that byte array as an integer. Additionally, this process is performed twice to make a connection from A->B and from B->A, so that introduces a 2X penalty for every link.

There seems to be quite a few ways to simplify this, for example if the kmer is kept as an integer possible neighbors could be identified by a few simple bit-shift/integer and-or operations, saving a lot of overhead. Similarly, links between nodes can be made simulatenously to save time, and additionally better use could be made of the slots for the A,C,G,T neighbors, which are not currently used in an ordering that identifies them as a slot that corresponds to a particular nucleotide. I am mucking around with PADENA enough that I could implement these changes if better performance is desireable, but I am not sure who else is working on it (Simon perhaps?) and there are at least a couple things in PADENA that I don't understand yet (in particular the use of nodeorientations and allowing for even-numbered/possibly-palindromic kmers).

Would people be interested in trying to make PADENA more performant? My sense is any loss in the readability of the code could be made up for by good commenting.

Cheers,
Nigel
Developer
Feb 27, 2013 at 1:18 AM
So I thought it might be fun to try out some code improvements for PADENA, but got some strange results.

I took a look at changing the GenerateLink function in PADENA so that it keeps working in the space of sequences encoded as ulongs when looking for neighbors rather than moving out to strings/byte arrays/etc. and back. I actually think this wound up making the code a little bit easier to read.

When examining the total CPU time reported by PADENA it drops from ~24 seconds to ~9 seconds, which is ~2.7X speed improvement. However, the whole thing is a bit strange to me because the Bio .NET 1.01 binaries that I have installed outperform the version I just compiled (even before making changes) by a pretty significant amount. I am not entirely sure what the differences are. Does anyone know if the current PADENA source is slower now than the previous version? Is the distributed version ngened or something?
Developer
Feb 27, 2013 at 1:20 AM
Simon, apologies I just remembered your statement about the PADENA container! Is the source going to be updated to reflect the reversion? I am working with PADENA now so would like to keep it consistent if possible.

Cheers,
N
Coordinator
Feb 27, 2013 at 1:58 PM
Hi,

This is very likely due to a change that was made post 1.01 to attempt to increase the number of nodes that Padena could create in the graph. In 1.01, it uses a simple array to track the nodes, then it was changed to use a BigList<T> where it's a sparse array (array of arrays), but in performance testing Simon found that it had a significant effect on the time it takes to build the graph (> 20%). That change was backed out and I think it was committed to the current trunk but you might check to make sure your source version doesn't have that code in it. It would certainly explain the difference.

mark

Mark Smith

On Tuesday, February 26, 2013 at 7:18 PM, evolvedmicrobe wrote:

From: evolvedmicrobe

So I thought it might be fun to try out some code improvements for PADENA, but got some strange results.

I took a look at changing the GenerateLink function in PADENA so that it keeps working in the space of sequences encoded as ulongs when looking for neighbors rather than moving out to strings/byte arrays/etc. and back. I actually think this wound up making the code a little bit easier to read.

When examining the total CPU time reported by PADENA it drops from ~24 seconds to ~9 seconds, which is ~2.7X speed improvement. However, the whole thing is a bit strange to me because the Bio .NET 1.01 binaries that I have installed outperform the version I just compiled (even before making changes) by a pretty significant amount. I am not entirely sure what the differences are. Does anyone know if the current PADENA source is slower now than the previous version? Is the distributed version ngened or something?

Read the full discussion online.

To add a post to this discussion, reply to this email (bio@discussions.codeplex.com)

To start a new discussion for this project, email bio@discussions.codeplex.com

You are receiving this email because you subscribed to this discussion on CodePlex. You can unsubscribe or change your settings on codePlex.com.

Please note: Images and attachments will be removed from emails. Any posts to this discussion will also be available online at codeplex.com


Developer
Feb 28, 2013 at 6:43 PM
Okay, I gather the changes on 6/8/2012 that manjus made were the ones that introduced the jagged array mechanism and this does appear to still be part of the source code. The 2 GB object limit is a a bit frustrating, but did anyone confirm that the jagged arrays were the issue? This piece of code also seems to introduce an AA tree to replace an older binary tree, which seems like an improvement so I don't know if the change should simply be directly removed or not, plus the BigList seems like a generally useful thing to have around. I think I may try to generally clean up and fix these PADENA issues, will let you know if/how performance is affected.
Coordinator
Mar 1, 2013 at 7:27 PM
Yes, that's it. I removed it for some comparison tests in a private build but didn't check the change in. All I did was remove the use of BigList<T> (replacing with a standard array type) in all the Padena code since I wasn't quite familiar with the other changes Manju was making which were specific to a project he was involved in. I used the prior version to guide me - I think if you do the same in your code then you'll find the performance goes back up, at the expense (possibly) of running out of memory when large graphs are generated. I'm actually not even certain how mandatory his change is in the face of .NET 4.5 lifting some of the memory pressure (in 64-bit apps you can now allocate objects > 2G in size, however arrays are still limited to 2G indexes since it's an integer index). He was trying to correct OutOfMemoryException issues with very large data sets, and if the OOM was a result of an overall object size (i.e. new long[Int32.MaxValue]) then .NET 4.5 fixes that with no changes. However if it was a # of elements issue, then it will still occur without the change.

mark


On Thursday, February 28, 2013 at 12:43 PM, evolvedmicrobe wrote:

From: evolvedmicrobe

Okay, I gather the changes on 6/8/2012 that manjus made were the ones that introduced the jagged array mechanism and this does appear to still be part of the source code. The 2 GB object limit is a a bit frustrating, but did anyone confirm that the jagged arrays were the issue? This piece of code also seems to introduce an AA tree to replace an older binary tree, which seems like an improvement so I don't know if the change should simply be directly removed or not, plus the BigList seems like a generally useful thing to have around. I think I may try to generally clean up and fix these PADENA issues, will let you know if/how performance is affected.

Read the full discussion online.

To add a post to this discussion, reply to this email (bio@discussions.codeplex.com)

To start a new discussion for this project, email bio@discussions.codeplex.com

You are receiving this email because you subscribed to this discussion on CodePlex. You can unsubscribe or change your settings on codePlex.com.

Please note: Images and attachments will be removed from emails. Any posts to this discussion will also be available online at codeplex.com


Developer
Mar 1, 2013 at 9:23 PM
I played around a bit with backing out some of the changes he made. The code struck me as a bit odd since Manjus seemed to have both an AA Tree to search for the data while simultaneously keeping the data in a list (either as an array or a BigList<> class). This seemed a little redundant until I tried to just use the AA tree as a data structure for the nodes/kmers and saw how slow enumerating the nodes in the tree structure was. Yowza, I had no idea how slow enumerating through binary trees can be relative to arrays in .NET. I wrote up the quick comparison I did here in case anyone is interested (or can point out an improvement).

The graph building stage is actually pretty fast now, and I gather from the profiler that for all the dangling links purging, contig building etc. are only slowed by enumerating the nodes in the graph. I might see if we can just do less of that or what the effect of moving all the nodes into a list once the search operations are over is.

I suspect we could get over 2bn objects and keep the speed by doing a nested enumeration of arrays where a class hands out an array, and then the nodes in that array are processed in parallel (so handing out arrays of 2bn objects for something else to iterate over rather than having one IEnumerable that hands out all of the objects from all of the arrays).

-N
Coordinator
Mar 18, 2013 at 12:30 PM
evolvedmicrobe wrote:
Simon, apologies I just remembered your statement about the PADENA container! Is the source going to be updated to reflect the reversion? I am working with PADENA now so would like to keep it consistent if possible.

Cheers,
N
Nigel,

Sorry for being inommunicative for the past few weeks, pressure of work in other projects. I'm not sure what my comments were regarding a container - but perhaps it is irrelevant now? If not, can you elaborate?

Simon
Developer
Mar 18, 2013 at 7:24 PM
Hi Simon, Comments I think are somewhat irrelevant now, they were in regards to the different ways to store reads between v 1.01 the current version in the source and the one I uploaded as a shelveset, more comments about it are in the graph buidling section of the padena improvements issue I made (8609).
Coordinator
Mar 20, 2013 at 12:35 PM
I would attach a spreadsheet of performance testing but CodePlex doesn't support attachments in the forums. I'm sure there is a great reason for this lack of useful functionality. I'll add some tables below.

Note the datasets for testing were very small compared to the real world, but while we are resolving the issues caused by implementing (and then removing) more scalable storage classes, these tests focus more on speed than scalability.

The data in the spreadsheet does not paint an unambiguous picture, but I would draw the following conclusions:
• Exact timings vary, which I ascribe to other workloads on the test machine.
• We see a slight variation in the number of contigs produced between versions. This is worrying, but it looks like the 1.01 version produces a very slightly different figure to the subsequent two versions, but only in one of the test datasets.
• Overall, it appears that the version with the scalable storage classes is slower vs. 1.01 on .Net 4.0
• Broadly speaking, all three versions have comparable speed on .NET 4.5 where the datasets are small. The modified storage classes version hits a massive slowdown at 1M reads
• Memory usage is much reduced once the storage class changes are backed out, for large data sizes
• The version of .NET (4.0/4.5) does not appear to have a significant effect on performance

I plan to do further testing on different hardware, but I believe this configuration (see below for details) is giving us poorer than usual performance - I also ran some tests on Win 7 (x64) and a quad-core processor and was seeing far better times. The value of these figures however is that the hardware is constant so we can do comparisons between versions.

In conclusion - more testing is needed, especially with larger datasets. These tests should be conducted once the latest shelfesets of Padena modifications have been checked in. As Nigel has said on another thread, his changes should result in more than a 2x speedup.

With .NET 4.0
                                  Total (seconds)       Peak memory (GB)                    
# reads (100Ks) # contigs         1.01        1.01a     1.01                   1.01a          
 1                 1322            749          884     (not recorded)          2.73          
 2                 3325           3070         5448     (not recorded)          5.46          
 3                14414*          4601         5116     (not recorded)          7.85          
 4                13861          10543         9078     (not recorded)         10.48          
 5                20734          13500        13620     (not recorded)         12.56          
 10               61692          77038        98330     (not recorded)         14.82          

* 14415 with 1.01a
With .NET 4.5
                                 Total (seconds)            Peak memory (GB)                    
# reads (100Ks)  # contigs     1.01   1.01a   1.01b     1.01              1.01a   1.01b
 1                  1322        628     774     843     (not recorded)     2.65    2.66
 2                  3325       4669    5632    5770     (not recorded)     5.24    5.21
 3                 14415       9756    6858    8030     (not recorded)     7.37    7.39
 4                 13861      10601    9953    9308     (not recorded)     9.93    9.96
 5                 20734*     15921   13745   11886     (not recorded)    12.06   10.71
 10                61692      68791  131554             (not recorded)    14.81          

* 20736 in 1.01
  
Test conditions:
I took a real 3.7M read dataset and used it to create sets of reads, each of which contained the reads from the previous set plus an increment. I created sets from 100K reads to 3.7M, but used only 100K – 500K plus 1M due to the time taken to perform the assemblies.

I used three different versions of PadenaUtil, giving them the following names:
1.01- the version shipping with the current .NET Bio release
1.01a – the version with scalable but slow storage classes
1.01b – a version with the slow but scalable storage class changes backed out

I ran 1.01 and 1.01a with .NET 4.0 and 4.5, and 1.01b with 4.5 only. This enables comparisons between versions of padenautil and also a comparison between .NET versions.

In each case, I used the command line:
padenautil assemble -k:32 -l -c:2 input.fa >output

All runs were conducted on hardware with the following spec:
• 2xQuad-Xeon (E54xx) processors
• 16GB RAM
• 1Gbit Ethernet adapter, up to 1.7TB attached storage
• Windows Server 2008R2 (SP1)
• By default the machine has .NET 4.0 installed – where needed I upgraded to 4.5.

I know this is not the recommended configuration, and indeed there is some evidence to suggest we run more slowly on this configuration, but it was available for the two-month period it took to complete these tests, and standard across all runs.
Coordinator
May 21, 2013 at 3:57 PM
I did some performance testing on the 1.1 Alpha version of PadenaUtil, and found it to be up to 60% faster than the previous version, which is great!

Unfortunately the performance advantage is concentrated on the early steps of the algorithm which scale relatively well with input data size - the graph undangling phase would be the next challenge for improvement.

Details below.

Test conditions:
I took a real 3.7M read dataset and used it to create sets of reads, each of which contained the reads from the previous set plus an increment. I created sets from 100K reads to 500K for this test.

For the 1.01 (current) release, I used the command line:
padenautil assemble -k:32 -l -c:2 input.fa >output

The 1.1 Alpha requires k to be an odd number to prevent the case where palindromic sequences generate infinite loops so I used K=31:
padenautil assemble -k:31 -l -c:2 input.fa >output

Runs were conducted on hardware with the following spec:
2xQuad-Xeon CPU W3520 @ 2.66Ghz
6GB RAM
Windows 7 Enterprise SP1, 64-bit
.NET 4.5


Performance
                                    Total (seconds)
#reads (100Ks)  #contigs         1.01        1.1 Alpha   Improvement
 1                 1322           749         135          64%
 2                 3325          3070         320          60%
 3                14414          4601         444          58%
 4                13861         10543         764          45%
 5                20734         13500        1450          17%
Developer
May 21, 2013 at 5:13 PM
Hi Simon,

Thanks for checking out the performance improvements! Quick question: Looking at the total (seconds) taken by the different programs, it seems comparing 13,500 seconds to 1,450 seconds means the 1.1 version runs in about 10% of the time, but is listed as only 17% improved, can I ask how improvement was calculated?

Indeed though, virtually all of the improvements were concentrated on the initial graph building stage (which for my data was the bottleneck). There is one slight improvement at the last contig building step, but that won’t be noticed unless the contigs assembled are long. One annoyance of the algorithm right now is that if a path goes from node A to node B, we trace it in both the A->B direction and the B->A direction, and then simply discard one path at the end. I considered modifying the program a bit to save this redundancy of work, but never got around to it, plenty of room for further improvement though!

Out of curiosity, what is the N50 of the contigs generated? I know the testing has previously been done with kmers of size ~32. However, at error rates of .1-2%, this means that a high fraction of kmers will be erroneous (relative to what is available around a kmer size of ~19). For prokaryotic genomes, I think the 17-22 bp size range might be the sweet spot, and might recommend we standardize on something similar for future tests (sounds like such parameters are being discussed with Jim now).

Warm wishes,
Nigel
Coordinator
May 21, 2013 at 9:35 PM
You ask how I calculated the performance improvement? There's a simple answer:

Wrongly. Oops :-)

I am frankly unsure where I got those last values from - I am assuming a copy and paste error. Here are the correct ones:
Reads (100Ks)   1.1a    1.01    Perf improvement
1               135      377       64%
2               320      800       60%
3               444     1058       58%
4               764     1379       45%
5              1450     1740       17%
As for N50 - coming soon...
Developer
May 22, 2013 at 10:23 PM
Ah okay, makes sense. I was expecting to see the program run in less than half the time now, and that seems to be the case. It's not entirely clear to me what is going on with the big drop off at 500K reads, might be good to discuss more benchmarking per the other thread. Do you know what genome this is? Would be curious to see in addition to N50 how the length of different steps varies across each the two.