This project is read-only.

Reads, real and artificial, testing of assemblers

May 21, 2013 at 3:32 PM
hi all
we are trying to determine a reasonable test set for testing padena. We have some other computational interest in shattering completed genomes at some distribution of read lengths around a known mean for each technology, and thence generating randomised read tilings and low level noise in the sequences generated.

So Q1: Anyone done anything like this in C# and or .NET Bio already?

Q2: (related) anyone know a good summary of the error rates and read length distributions for NGS? Metzker (Nat Rev Gen Jan 2010) for example has lots of good basic data, but no variance for read lengths and no actual error rates as far as I can see. Anyone?

This can be used as one set of tests for padena, but I was wondering what people thought would be bare minimum of tests we would want to consider. The test coverage problem suggests a number of parameters:
  • technology and thus read length
  • euk vs arch vs prok
  • sensitivity to error rate
Q2: anyone more familiar with the assembler literature seen a standard set of data that people use? I've not done a major search but this seems to vary a fair bit.

We will happily share code as we go anyway, but any comments on a suitable basis set would be good, as would any code that others have used as well.

There are of course real sequencing project reads lodged for various projects. I know particularly of some for staph. other pointers or suggestions welcome.

cheers
jh
May 21, 2013 at 4:42 PM
Hi james

In the past i made a lot of tests with padena, using a small cluster with 32 GB RAM, in Windows Server 2008 R2, with HPC Pack, using padena version of MBF 2.0 and .NET Bio 1.0

Q1: Now i'm working in another stuff related with .NET Bio and C#, but i'm interested in work also in this kind of topics, i'm C# programmer. about share the code, it will be interesting.

Q2: About the assembler literature i have many articles related with that, even i was working in a small project consists, in make a benchmarking of diferent assembler tools in Windows HPC Cluster. In the next articles you will find important information about, data, parameters, benchmarks, etc.

GAGE: A critical evaluation of genome assemblies and assembly algorithms. http://www.ncbi.nlm.nih.gov/pubmed/22147368

A Practical Comparison of De Novo Genome Assembly Software Tools for Next-Generation Sequencing Technologies http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0017915

the both articles are free access, so I hope you enjoy this, and I will be happy to retake this issue and work together and maybe retake my intentions of publish about this, together.

Cheers,

Leo
May 21, 2013 at 4:48 PM
Edited May 21, 2013 at 4:49 PM
Hi Jim,

Q1/Q2

For simulating reads from a reference base, I think there are two issues (for the simple non-paired end case). Deciding where a read “start” on the reference genome, and then introducing errors in to that read given the sequence it is derived from.

Assuming a read starts at some position and is of some known length, most people seem to use a uniform error rate across the reads, and a uniform substitution probability for each base. These assumptions are wrong of course. I went looking for more sophisticated software before to avoid training my own error model with known data and I found this package which seemed useful, but I wound up just training on real data instead, might be good to look at though. http://www.niehs.nih.gov/research/resources/software/biostatistics/art/ . I think someone at the Broad was also considering doing a better simulator, last time I was thinking about this, can send them an email to ask.

As for deciding coverage from a known genome, the naïve assumption is that reads start in a poisson distributed fashion. In actuality, the read coverage is very over-dispersed. As a first approximation, one can simulate read start locations from a negative binomial with variance 2-20X higher than the mean I think. In reality, coverage also co-varies strongly with things like regional GC %, but this complexity can vary across replicates and is probably not worth simulating. The depth distribution from some early sequencing data of a bacterial strain is shown attached, this is much better technically now but still pretty nasty.

For read length, this largest determinant here is whether quality trimming is done. For Illumina sequencing, virtually all usable reads are of the length produced by the sequencer, but as the end of reads are often of lower quality, they are often trimmed by various criteria before assembly to avoid problems due to errors, this introduces a lot of variance.

For test parameters:

Technology: Focusing solely on Illumina seems the best choice.

Domain: Eukaryotic genomes are too large, and are locally well assembled by the GATK haplotype caller. PADENA seems most useful in the >4 kb <6 MB domain.
Sensitivity to Error Rate: Not sure what is best here.

Test Data: Don’t know of any standards, would be happy to provide E. coli data (quite even coverage) or M. extorquens data (the genome has higher GC and so more over-dispersed coverage). A PhiX genome is sequenced as part of every illumina control lane, as a minimum we could ensure these are always assembled.

Cheers,
Nigel
May 30, 2013 at 3:25 AM
Hi Nigel, leonardo,
Thanks for the great replies.

Nigel - I have been talking with Tim just now about the simulation approach, and we were looking at contributing a read simulator to .NET Bio. The building blocks are there as part of the library, and it would be a nice application. However, the ART simulator comes with everything, and has windows command line binaries, so I think we will work with that in the first instance, and then see whether there is a gap in the market... We also had a look at wgsim: https://github.com/lh3/wgsim - some old fashioned C code, and also at a very good looking package for metagenomics called metasim http://ab.inf.uni-tuebingen.de/software/metasim/welcome.html. Obviously the second one is not relevant here, but it does look great.

So, we will initially work with ART and if possible with your E. coli data as that seems to fit the guidelines pretty well. When you assembled, what tool did you use? and do you recall or can you regenerate the appropriate command line options for comparison? [It would be good to run them side by side and see what the execution times are just for the sake of completeness.]

Anyone got any other thoughts on this?

Cheers
jh
May 30, 2013 at 3:15 PM
Instead of creating a new read simulator, you could always extend the one that ships with .NET Bio - although it is very basic; given that there are better ones out there though, it may be simpler (especially if you just need it for testing) to use one that is already fully-featured.

Simon
May 30, 2013 at 3:45 PM
Hi all

I'm totally agree with Simon, .NET Bio has a read simulator, i'm not sure about the "quality of the reads" in a biological sense, but i think is a good solution for generate data, also has a visual interface in contradistiction of ART, in that way is easier than another console aplications for that, you can find this tool when you make a complete instalation of .NET BIO in the next directory : C:\Program Files (x86).NET Bio\1.01\Tools\Bin\ReadSimulator.exe.

I think is a better approach if we focus on get better the existing readSimulator Util.

James if you need more sample data, i think this is a perfect test data from E. coli commensal strain K-12, sequenced in Ilumina with paired end data, http://download.clcbio.com/testdata/raw_data/solexa.zip.

Leo
Jun 11, 2013 at 7:50 PM
Edited Jun 13, 2013 at 3:23 AM
Hi Jim,

Some E. coli data from a resequencing study is available here, but let me know if you need more! : http://barricklab.org/twiki/pub/Lab/ToolsBacterialGenomeResequencing/REL8593A.fastq.gz

For comparisons, when testing I was running alongside velvet and the GATK haplotype caller to make sure I got the same answers. Although most of the open source assemblers are all *nix, you can save yourself a lot of virtual machine hassle by comparing to the velvet assembler found here that have been ported to windows:

http://www.applied-maths.com/download/open-source

I found it worked quite well. Some example powershell command lines I used are as below:

$ToAssemble="Deletion_3_Circle_1.fa"
D:\Programs\Velvet_1.1.04_executables\win64\singlethread\velveth_st_x64.exe temp 21 -fasta -short Deletion_3_1.fa
D:\Programs\Velvet_1.1.04_executables\win64\singlethread\velvetg_st_x64.exe temp -cov_cutoff 4
$PADENA=D:\BioTFS\bio\Build\Binaries\Release\PadenaUtil.exe
&$PADENA Assemble -o:PadenaAssembly $ToAssemble
PadenaUtil.exe Assemble -i -o:"Test.fna" -k:21 -v -c:10 "Deletion_3_Circle_1.fa"


For diagnosing problems, I borrowed a bunch of classes from the NodeXL guys to output Graphml files of the Debruijn graphs, which I have been visualizing in Gephi, along with various histograms of node coverages and splits. These visualization classes sit on top of padena, and output the graph after condensing all linear changes into contigs, which has been useful, if anyone else would like the code I can send on.

How have you found Art so far?

Cheers,
Nigel