This project is read-only.

padenautil parallelization

Feb 22, 2012 at 7:54 PM
Edited Mar 5, 2012 at 7:55 PM

Hey,

I just started playing aorund with Padena. In the Technical guide is stated that Step 1,2 GraphConstruction is parallelized but I can't see this  in the TaskManager. While PadenaUtil.exe prints this:

Step1&2: Create Kmer and Graph - Start time: 2/22/2012 11:31:34 AM

315947 sequence(s) processed.
315947 sequence(s) processed.
315947 sequence(s) processed.
315947 sequence(s) processed.

the number of threads is between 6 - 7 and barely one CPU is used.

What do  I miss here?

Thanx

Jan

Feb 22, 2012 at 11:07 PM

Not all steps of PADENA are parallelized, but de Bruijn graph construction is.

We use the Task Parallel Library (TPL), which is the more sophisticated of the Parallal Extensions methods (the other being parallel for/foreach). TPL looks at workload and processor availability before scheduling threads on multiple processors, so not all workloads will result in the pattern you are looking for in Task Manager.

All I can suggest is to load your machine with other tasks or try bigger workloads, then you will see more activity.

Simon

Feb 23, 2012 at 5:03 AM
Edited Feb 23, 2012 at 5:07 AM

Here's a little more info, first let me issue a disclaimer: I am not the author of the Padena code, and do not know all the ins and outs of it, but I have gone through the code base in the past to understand what it was doing - here's my recollection.

As Simon mentioned, Padena uses the Task Parallel Library (TPL) - this is an abstraction over threads which provides some intelligence to managing parallelization of work.  When Padena is creating the de Brujin graph and k-mers, it creates several synchronized tasks - one is creating the k-mers and another is consuming them to produce the graph nodes and edges.  You can see this code in the DeBrujinGraph.cs file in the Bio project: (http://bio.codeplex.com/SourceControl/changeset/view/72549#1267529)  The work I'm referring to is in the Build() method.  The work is done in parallel, but only on two tasks.  So as the k-mers are constructed, the graph is built, however the list of passed sequences are done sequentially - i.e. it's one task generating the k-mer list.

From Task to Threads, the TPL can spin up threads for each generated Task (or utilize the thread pool), or it can consolidate tasks and run them synchronously on the same set of threads (this happens if the task work is really short, or when the # of cores on the machine is low).  It's based on running heuristics (how many threads are running, memory, processor count, etc).  In this case, I would expect two separate threads to be created - and they would be running in lock step to each other - the task performing the graph construction will block waiting on data, and the k-mer creation task can pause if the queue gets too big (i.e. the graph thread slows down).  Again, you can see this in the code I cited above.  Given that the threads can block, I wouldn't expect 100% CPU utilization, and on a box with more than 2 cores I would never see anything close to 100% for this.  

You didn't mention your system specs, or how you ran the program (Command Line, VS.NET, etc.) but I assume you have more than 2 cores given the CPU peaks you hit.  Also keep in mind - other work is being done as you see 6-7 threads going.. one is the primary thread, one is the finalizer, so we have 2 other threads doing some work, or waiting for work at the point where you examined Task Manager.

There are other aspects of Padena which are also performed in parallel, in almost all cases, Tasks were used so if you do a search through the code base for Task.Factory.StartNew you should find most of them.

I hope that helps!

mark

 

Feb 24, 2012 at 6:28 PM

Thanks for the answer. I have 2 Xeon 8 Core processors at 2.93GHz and 96GB RAM. I'm currently running a data set with 8033554 sequences and I could see the parallelization in later stages of paden on all cores. However, it runs already for almost 24h and isn't finished yet.

E:\apps\bionet>PadenaUtil.exe Assemble -a -o:e:\apps\contigs.fa  e:\apps\62F6HAAXX.1.1_filtered.fasta
Padena Utility v1.0
Copyright (c) 2011, The Outercurve Foundation.


Initializing - Start time: 2/23/2012 11:10:30 AM

Initializing - End time: 2/23/2012 11:11:02 AM
Step1&2: Create Kmer and Graph - Start time: 2/23/2012 11:11:02 AM

1548649 sequence(s) processed.
2752802 sequence(s) processed.
3650659 sequence(s) processed.
4848741 sequence(s) processed.
6003474 sequence(s) processed.
7001716 sequence(s) processed.
7638258 sequence(s) processed.
Processed total 8033554 sequencecs.
    Graph built successfully.
    GenerateLinks Started.
........................................
    Generate Links Ended......................................................................................................................................................
Step1&2: Create Kmer and Graph - End time: 2/24/2012 10:02:49 AM
Estimating default values - Start time: 2/24/2012 10:02:49 AM

Estimating default values - End time: 2/24/2012 10:02:49 AM
Step3: UndangleGraph - Start time: 2/24/2012 10:02:49 AM
.

 

Feb 27, 2012 at 5:21 PM

Jan, just checking in with you. Did the process complete? I think this last post indicated it wasn't finished yet.

A long run is not unreasonable if the dataset is large enough. It would be good to know what the data size is, how long the reads are, what coverage depth it should represent, and how many reads there are – also whether Ns have been screened out of the reads, which is important QC step to get good results.

Feb 27, 2012 at 5:57 PM

PadenaUtil crahed after 269H of CPU time somewhere in Step3. Actually it's not a very large test data set from a bacterium and it removed N's with Bio.Net filter util. I will check the exact number. I will start teh process again with a debug version of PadenaUtil that gives me teh exception.

Feb 29, 2012 at 1:41 AM

Ok here the exception that come up.

    Graph built successfully.
    GenerateLinks Started.
One or more errors occurred.
   at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at System.Threading.Tasks.Task.Wait()
   at System.Threading.Tasks.Parallel.PartitionerForEachWorker[TSource,TLocal](Partitioner`1 source, ParallelOptions parallelOptions, Action`1 simpleBody, Action`2 bodyWithState, Action`3 bodyWithStateAndIndex, Func`4 bodyWithStateAndLocal, Func`5 bodyWithEverything, Func`1 localInit, Action`1 localFinally)
   at System.Threading.Tasks.Parallel.ForEachWorker[TSource,TLocal](IEnumerable`1 source, ParallelOptions parallelOptions, Action`1 body, Action`2 bodyWithState, Action`3 bodyWithStateAndIndex, Func`4 bodyWithStateAndLocal, Func`5 bodyWithEverything, Func`1 localInit, Action`1 localFinally)
   at System.Threading.Tasks.Parallel.ForEach[TSource](IEnumerable`1 source, Action`1 body)
   at Bio.Algorithms.Assembly.Graph.DeBruijnGraph.GenerateLinks()
   at Bio.Algorithms.Assembly.Padena.ParallelDeNovoAssembler.Assemble(IEnumerable`1 inputSequences)
   at Padena.AssembleArguments.AssembleSequences()
   at Padena.Program.Assemble(String[] args)
Value cannot be null.
Parameter name: kmer
   at Bio.Algorithms.Kmer.KmerData32.CompareTo(IKmerData kmer)
   at Bio.Algorithms.Assembly.Graph.DeBruijnGraph.SearchTree(IKmerData kmerValue)
   at Bio.Algorithms.Assembly.Graph.DeBruijnGraph.<GenerateLinks>b__c(DeBruijnNode node)
   at System.Threading.Tasks.Parallel.<>c__DisplayClass32`2.<PartitionerForEachWorker>b__30()
   at System.Threading.Tasks.Task.InnerInvokeWithArg(Task childTask)
   at System.Threading.Tasks.Task.<>c__DisplayClass7.<ExecuteSelfReplicating>b__6(Object )

E:\apps\bionet>

Feb 29, 2012 at 6:32 PM

Jan, this looks like a bug to me. Can you file a bug under the issue tracker tab? Presumably you can include the data you used so we can replicate and try and help resolve this. If you uncover the problem yourself and get it fixed I can help with the steps to getting the fix contributed back to the project. Appreciate you taking the time to do this and using .NET Bio.

Rick for the .NET Bio team

Mar 2, 2012 at 4:10 PM

jam,

 

I'm working in the same thing with padena Assembler, i'm using a cluster manager for  HPC pack 2008 R2, this job manager has a tool, named charts and reports, in the section cluster utilization you shall see, the real utilization of your cores. let me know if this information serves you and maybe we can share our results because i  dont know another people that are working in this tool for paralelization processes.

 

greetings,

 

@MontesLeonardo

Mar 2, 2012 at 11:23 PM

Beside the crash, I found the Padena asssembler super slow. I thought a main purpose of Padena is speed. The generation of  contigs with the same data sets takes minutes with other assemblers.

Mar 5, 2012 at 5:28 PM

Hi Jan,

So far as we know, PadenaUtil should not be as slow as you are finding it - what would be very useful for us is some comparative figures giving the characteristics of the dataset (size in Kbytes, read lengths, number of reads) with time taken on PadenaUtil and time taken on another comparable assembler (ideally de Bruijn graph-based and running on a single machione, not cluster or cloud). We did some benchmarking prior to release, but were restricted by the licenses of some assemblers from benchmarking against a wide range; where we did test though, we were comparable.

In the event your benchmark shows a big difference and you are able to share your dataset with us, it would help us with addressing any performance issue.

Thanks,

Simon

 

Mar 5, 2012 at 5:53 PM

Jan, I saw the issue you proposed and yes we would need the dataset to reproduce the problem here.

You can send to my microsoft email address (a-rickbe at microsoft dot com). You should have it from when you attended our course but if you have trouble reaching me let me know. We want to get to the bottom of this one.

Also we are working on .NET Bio 1.01 and if this is easily identified and hopefully easily fixed perhaps we can get this included and you would have a fix shortly. No promises but for timing purposes getting the data to me quickly will increase the likliehood of a fix.

Rick for the .NET Bio team

Mar 5, 2012 at 7:55 PM

I can share the data set for testing purposes but since the data are not published, you are not supposed to put them somewhere public. Thanx, for doing investigation on this issue.

Rick I will share the data via SkyDrive with your Microsoft e-mail.