How do I extract a sequence fragment of length N from a genome upstream from a specific gene?
This is a simple enough task with a few wrinkles. The assumption is that we are looking for regulatory regions or other features in the non-coding intergenic regions. We have some method available to parse a genbank file (see recipe #10) and to access the metadata
(#11). The idea is then:
- Take a specific gene id and the size of the region required
- Use the gene metadata from the gbk file to find its strand and start and end points
- Use this info to extract a subsequence from the whole genome sequence object
Simple, but with minor complications in making sure we have the right strand and the right direction. Let's not consider operons too carefully at this stage.
D2 Style Distance Measures:
For clarity on the distance measures mentioned above, we want: D2, D2* and D2S as described in the early sections of this paper from the Waterman group:
. Note that while we will put this in the cookbook to begin with, we should also look to incorporate it into the core library as a set of distance or similarity measures. Thoughts welcome
on how best to manage and locate this in the hierarchy