Getting Cooking Again

Jul 28, 2014 at 1:53 PM
There are a number of additions around the cookbook which make some sense, but again we are looking to the community for some guidance on things to work on. We have now managed to recruit a plethora of students for the new semester and we are working through the library to help people do some exploration.

Here are a few current thoughts on priorities for the cookbook:
  • How do I generate synthetic sequence data?
  • How do I extract an indexed set of kmers from a genome or genome fragment?
  • How do I use the kmer set to calculate a distance or similarity measure between two sequences?
  • How do I rapidly align two sequences using mummer?
  • How do I parse a BAM/BED/SAM/??? file and make sense of what comes back?
    [this is easy for the programmer, but needs biologist input, so welcome to any volunteers/]
  • How do I perform a multiple alignment using the Clustal Web Service?
This is a quick starting list. We need about 20 or so, so please keep them coming.

Jul 29, 2014 at 1:31 AM
Hi Jim

I will take a look at:

How do I generate synthetic sequence data?
How do I rapidly align two sequences using mummer?

Jul 30, 2014 at 5:06 AM
Additional thoughts:

How do I extract a sequence fragment of length N from a genome upstream from a specific gene?
This is a simple enough task with a few wrinkles. The assumption is that we are looking for regulatory regions or other features in the non-coding intergenic regions. We have some method available to parse a genbank file (see recipe #10) and to access the metadata (#11). The idea is then:
  • Take a specific gene id and the size of the region required
  • Use the gene metadata from the gbk file to find its strand and start and end points
  • Use this info to extract a subsequence from the whole genome sequence object
Simple, but with minor complications in making sure we have the right strand and the right direction. Let's not consider operons too carefully at this stage.

D2 Style Distance Measures:
For clarity on the distance measures mentioned above, we want: D2, D2* and D2S as described in the early sections of this paper from the Waterman group: Note that while we will put this in the cookbook to begin with, we should also look to incorporate it into the core library as a set of distance or similarity measures. Thoughts welcome on how best to manage and locate this in the hierarchy
Aug 15, 2014 at 9:06 AM
Edited Aug 15, 2014 at 9:06 AM
A new suggestion for the cookbook:

Let's say I have a sequence object. How do I quickly add and annotate a feature onto the sequence (let's say a putative promoter element 1456 to 1578) and export this as a genbank file?

the genbank file should contain something like this:
promoter 1456..1578
               /label="putative promoter sequence"
Sep 3, 2014 at 1:56 PM
Looked at this the other day and noted that we don't have an entry for saving genbank files at this point. So let's kill try two birds with the one stone. Any takers?.