Use of PADeNA in a distributed memory enviroment

Jan 24, 2012 at 11:30 PM

hello everydoby,

 

I'm using a windows cluster with windows server 2008 R2 HPC edition using HPC pack for MPI , i was triying to run PAdena in cluster with a single FASTA sequence of 1.3MB, but i don´t know how, now i have a fq sequence of 8GB. and i want to assembly this sequence. here mi questions:

1. is padena only for fasta files?

2. where can i get a free format parser with high quality?

3.  Does PAdena  have a size limit?

4. Can i use padena for a distributed memory model? where can i find documentation or how to make it using windows enviroment?

5. anybody knows the performance of PAdena vs ABBYS or another sequence assembler?

 

greetings,

@MontesLeonardo

Coordinator
Jan 25, 2012 at 4:22 PM

Hi Leonardo,

To answer your questions above:

1. PadenaUtil currently only reads files in the FASTA format. PADENA is the algorithm in the .NET Bio library, and PadenaUtil is the small commandline application we supply as a demonstration application to show how a programmer can use PADENA. PadenaUtil can be simply adapted to read any other DNA sequence file format supported by .NET Bio (for example GenBank, FASTQ, BAM and SAM) - this is already in the database as a feature request.

2. A file format converter is available in the SDK/Tools/ directory (look in the subdirectory FileFormatConverter). This is a sample app you will have to compile, but should meet your needs. Not sure what you mean about 'high quality' - the conversion between FASTQ and FASTA format will simply remove quality values from the file. If you would like to remove regions of low quality before making your file conversion, take a look at the Sequence Quality Control Studio (SeqCOS) application, which is built on top of .NET Bio and also available on CodePlex.

3. PADENA (and therefore PadenaUtil) does have a size limit - it can handle up to 2 billion different sequences, and it can handle any single sequence with up to 2 billion bases. Exceeding either of these limits will generate an error. Which of these limitations you encounter is dependent on the data - so if you assemble a very long sequence (over 2Gbp) or have a dataset size in excess of 2bn sequences, the current version of PADENA will not handle it. Given you are using an 8GB FASTQ file which will be much smaller after conversion to FASTA, you probably will not encounter either limit.

4. PadenaUtil will run on your configuration, but is not able to take advantage of distributed memory, MPI or HPC features. The current implementation of PADENA is designed for parallel execution on a single machine, and so if you have a computer with many processors or cores, it will (as far as possible) divide the work between them to run more quickly. This version is not adapted to take advantage of a HPC cluser though, and so you will see no advantage in using this hardware (it will still run, though).

5. PADENA was developed while .NET Bio was a project inside Microsoft, and for legal reasons it was not possible for us to have access to the full range of bioinformatics tools available to academics. Consequently, the range of applications we were able to benchmark against was restricted. It would therefore be better if someone from the academic community was able to answer your question regarding relative performance.

Simon

 

Jan 25, 2012 at 5:36 PM

thanks  for your answer simon, my priority was run PAdena in a HPC cluster with 3 nodes and 1 header node using using HPC Pack, well, can you tell if there exists any tool for develop a similar work using microsoft enviroment? and could you show me in the padena paper or any document, where i find something like padena is not designed for distributed memory or HPC cluster (for now), the reason is that i want to prove and document all the .Net Bio tools, because after to do that i want to start to develop my own Assembler tool using .Net bio.


thanks again for you valuable response and please forgive my english,

@MontesLeonardo

Coordinator
Jan 25, 2012 at 11:01 PM

Writing an efficient HPC implementation is not trivial, I would advise you to look carefully at other applications and understand MPI and related technologies first - I am not an expert in this area, sorry. You may want to look at the code for SampleCluserApp in the Tools directory of the SDK, for example.

The PADENA paper is available from this site under the Documents tab. As a scientific paper, it discusses the algorithm and not the specific implementation was have made. As with the documentation of PADENA, we did not think to list the platforms and technologies we did not use.

Still, you may find the paper useful because the parallelized algorithm steps described there might also be parallelized on other platforms if you wish, such as HPC or the cloud. I would expect there to be considerable programming effort in doing so, though.

I wish you all the best in your efforts,

Simon

 

Feb 3, 2012 at 5:09 AM
Edited Feb 3, 2012 at 5:09 AM

Thanks for the recomendation, i am using FileFormatConverter is a excelent tool (and maybe the best that i have known) to parse and convert large sequences. 

regards,

@MontesLeonardo

Coordinator
Feb 3, 2012 at 3:20 PM

Looks like you fixed the error you listed in your otyher post - I'll delete it, but if you still find a problem please repost.

Feb 3, 2012 at 9:01 PM
not realy simon, my other post was fromo another tool, FileFormatConverter. that is the post you hace deleted.

thanks for you responses,

greetings

2012/2/3 sjmercer <notifications@codeplex.com>

From: sjmercer

Looks like you fixed the error you listed in your otyher post - I'll delete it, but if you still find a problem please repost.

Read the full discussion online.

To add a post to this discussion, reply to this email (bio@discussions.codeplex.com)

To start a new discussion for this project, email bio@discussions.codeplex.com

You are receiving this email because you subscribed to this discussion on CodePlex. You can unsubscribe on CodePlex.com.

Please note: Images and attachments will be removed from emails. Any posts to this discussion will also be available online at CodePlex.com




--
Leonardo Montes Marín
@MontesLeonardo