Genome Browser for .NET Bio: Sequencing Reads

Coordinator
Oct 24, 2011 at 11:18 PM

I have interest in using a general purpose Genome Browser to quickly view my BAM files. The key is that I need to zoom out to chromosome view and then zoom into reads. I need to indicate SNPs and defects and show gene annotations. Sequence Assembler works well at the reads level, DNA Sequence assembler does a good job of annotation, and SilverGene does a good job of zooming, but I need all three in one application.

I would like to work with other people who have the same interest and hash out a minimal spec. Then I assume we would review the code from all three tools and decide how to organize the effort.

Add to this discussion, if you are interested. Or tell me if you have a better idea, along these lines.

Oct 25, 2011 at 1:43 PM
Edited Oct 25, 2011 at 1:43 PM

Hi khaden,

Have you looked at existing programs such as BAMView (http://bamview.sourceforge.net/), IGV (http://www.broadinstitute.org/igv/), and MagicViewer (http://bioinformatics.zj.cn/magicviewer/)? They tackle the semantic viewing problem, as well as integrating annotation you mentioned.  From this a spec can be hashed out that replicates the best features of each, while adding other features as well.

Vince

 

Coordinator
Oct 25, 2011 at 4:23 PM

Hey Vince,

Thanks for the links. I reviewed those apps and these are the features I think are useful. We can certainly build a spec from them. Unfortunately, I don't think these are code bases we can use because they are all Java based. I am looking for a code base that can become a starting point for .NET BIO developers to add their application specific widgets. In effect, I am looking for a framework that each developer can add visualization according to their interests/objectives. There are so many genome browsers available, that I do not believe one GB can fit all applications. Additionally, application developers need a generic viewer to include in their app to round out the features for a complete solution. Specifically, the Genome Browser is not necessarily the main focus, but it is part of the application to help view the data when required.

Additionally, I am also trying to gauge the interest for customizable visualization and the utility of a GB component.

Kirt

Oct 25, 2011 at 4:47 PM

Your idea sounds neat, but it is still unclear to me what the specific aim is. I'll try to get at it by asking some more questions and commenting on your last post:

"I am looking for a code base that can become a starting point for .NET BIO developers to add their application specific widgets."

Do you mean creating a spec/API on how dataviz components should interact with each other?

"In effect, I am looking for a framework that each developer can add visualization according to their interests/objectives."

Does this mean some kind of layer above .NET Bio that can represent each object (Sequence, etc.) in using various visual means (rectangles like UCSC tracks, full sequence, etc)? Something like using the .NET Drawing namespace and tailoring it towards drawing .NET Bio objects.

"Additionally, application developers need a generic viewer to include in their app to round out the features for a complete solution."

Like charts, and scatterplots? Maybe Sho can be leveraged for this?

"Specifically, the Genome Browser is not necessarily the main focus, but it is part of the application to help view the data when required."

If a system is developed to visually represent .NET Bio objects, then a Genome Browser is a good logical step forward.

Vince

Oct 25, 2011 at 8:14 PM
Edited Oct 25, 2011 at 8:15 PM

It sounds to me as if khaden wants to implement a separate GB specialized to this library as the root data model.  I don't think he is looking for design details such as how components should interact just yet.  He appears to just be asking for input to the requirements for such a component and how much interest is there in this effort.  While vforget may be a little ahead of himself by discussing architecture so soon, I think he is on the right track if we are to look down the road.  What I think vforget is trying to say is we should add these requirements:

  1. There should be a visualization layer that plugs into the .NET Bio libraries as the default data hierarchy.
  2. The data visualization components must be able to interact.
  3. Visualization should include the ability to view charts and plots as well as be a full-service Genome Browser.

Did I capture your ideas correctly vforget?

Personally, I don't see how the .NET Bio library can survive without a visualization mechanism so I applaud khaden's initiative.

Marcus

Coordinator
Oct 27, 2011 at 9:02 PM

I agree - something along the lines of a general GB and interacting well with .NET Bio is really needed. Unfortunately I don't know any codes that would be useful here as a template or starting point. We do use IGV at CBSU often, I like the functionality.

Jarek

Coordinator
Oct 27, 2011 at 11:54 PM

Although .NET Bio does not have its own visualization mechanism, it can be integrated with a range of existing ones:

  • The Excel add-in permits usage of all the graphing and related features in Excel, and the add-in itself demonstrates how to link to other Office apps.
  • The Excel add-in also uses the NodeXL graph visualization system, also available on CodePlex, to draw the proportional Venn diagrams which visualize shared and unique genomic segments for comparative genomics (and NodeXL can do much more)
  • The Genome Assembler demo app shows how Windows Presentation Foundation and Silverlight can be used to visualize genomic data, and the Silverlight apps on the QUT.Bio CodePlex site go much further
  • Integration of .NET Bio and Sho (http://research.microsoft.com/en-us/projects/sho/) is simple and featured in our training courses (course materials are on this site) and gives access to configurable visualizations in Python scripting
  • Finally, the BLIP application (BLAST in Pivot) ises .NET Bio and the Silverlight PivotViewer control to visualize genomic data - it is also a CodePlex project. Also take a look at http://www.microsoft.com/silverlight/pivotviewer/ for more on the viewer.

I admit - none of these are a genome browser and we could certainly use one, but collectively they go a long way towards providing a visualization mechanism on top of .NET Bio - and I haven't mentioned all of the wider range of bioinformatuics tools that read any of the file formats .NET Bio can write, such as BAM, SAM and GFF.

I feel strongly that a genome browser built on top of .NET Bio would be a huge benefit, and in fact we have made attempts to meet this need using DeepZoom, but the prototype needs a lot more work. We are also exploring the possibility of a multitouch app using Metro and Windows 8. Whether or not these turn into reality, I very much support Kirt's idea of a 'browser toolkit' using .NET Bio consisting of a librarty of components - while there might be a default app built, the real value would be in a library of componentry that could be extended, repurposed and recomposed simply to build custom browsers suitable for different needs.

To throw some ideas out there for components:

  • An abstract track class
  • A visualization canvas that can support multiple tracks
  • A set of visualizations of data dependent on track type, such as a histogram, a character string, a line graph, line graph with error bars, text labels on regions, and embedded graphics

To throw out some challenges:

  • How best to navigate huge genomes?
  • How to represent relationships between related genomes, or visualize complex rearrangements such as neoplasias
  • How to represent polyploidy
  • How to visualize genes and regulatory elements implicated in a common pathway?
  • How to collectively visualize genes of similar function, regardless of physical location?
  • How to visualize metagenomic samples?

 

 

Oct 28, 2011 at 2:55 PM

I find sjmercer's suggestions to be excellent, in particular the componenets-based approach to development.  I believe that .NET Bio would benefit from having a basic visualization layer, putting it on par or above existing other Bio libraries. However, the challenges got me thinking.  What are some usage case scenarios for this Genome Browser "framework".  Considering the existence of multiple popular genome browsers, what would be the distinctive feature of the .NET Bio GB (or its components)? I bring this up because the scientific community in general favors established tools over new ones UNLESS they offer something that drives their research into new directions.  This is exemplified by the near total dominance of the BLAST algorithm despite there being numerous other alternatives. IMHO, I would say that all currently available genome browsers are hindered by their reference-based approach to visualization. Moving away from this would allow us to visualize the challenges that sjmercer mentions.  There are actually some very nice genomic data visualizations out there (http://circos.ca/http://www.bcgsc.ca/platform/bioinfo/software/abyss-explorer) that may provide some inspiration.   

 

 

 

Coordinator
Oct 28, 2011 at 4:19 PM

Great! Ok - I have been digesting this and rereading the comments multiple times. I do not have a complete vision, but these are the pieces I see so far:

  • The strategy would be to enable everything, implement some. I think the key is extensibility which provides the outlet for those people that need specialization in a GB
  • IMHO, the communication strategy need to be a common event system (Observer pattern) to allow anything to fire an event and many object can respond to events.
    • Alternatively, maybe this is simple enough for just data binding
  • The basis of the system would rely on a controller (chromosome, start, stop) or controllers driving tracks of information
  • Tracks - (see sjmercer) I see adding, removing, subtracting, combining tracks. They can be graphs, shapes or annotations (linked text) that can be "decorated" according to the "environment". Even a reference genome could be a track so that we break away from the common requirement of reference genomes as a requirement as mentioned by vforget
    • Tracks could have different "controllers" (chr, start,stop) to handle different genomic locations of different genomes (ie metagonomics)
  • Strong table to visuals interaction: Allow the user to filter and sort tables of interesting that are synchronized to tracks. This allows users to find feature using numerical techniques and the switch to browsing when the right location has been found
    • This would allow viewing genes of similar function(sjmercer) in a table by sorting on a "gene function" column and allowing for a multi-line select to show all the genes in the contained region
  • SHO - to allow configurable visualizations to allow the user to manipulate the graphs
  • Navigation of huge genomes
    • I would love to use Deepzoom, but I do not feel qualified. I really like the ideas in Chronozoom and I should dedicate some time to understand it. 
    • Alternatively, we could use the ideas in SilverGene that shows multiple tracks and the features in each track "come in focus" at different resolutions.
    • The other solution is to transition between "views" like genome view, chromosome view, chromosome bands, genes, exons, and finally bases
  • "Do I mean a specific API" - I guess I do mean a specific API in the form of delegates
  • "Like the .net Drawing space" - Yes, I think that is the correct model where pens, brushes, shapes, and colors would have visual genomic counter parts like Read, Gene, SNP, Annotation that responds to common style settings
  • Visualizing structural variation - I think it is important to show chromosomal event like insertions, deletions, trans-locations, inversions using a cartoon of a chromosome, potentially two if the event happens between two chromosomes. I want to give the Google maps big picture and zoom into details.

I know that Jeffrydotnet will tell me that I have jumped to design too fast and I haven't been able to cover all of sjmercer's requirements, but it shows my thinking so far.

I am hoping that readers will throw stones at my straw man, because from the ashes, comes a greater design.

Coordinator
Nov 1, 2011 at 8:13 PM

Sounds interesting - I suppose I see a browser as a 'container' for 'tracks' which display different types of data anchored to a common scale, then with some form of transition to a different visualization that permits visualization of features (perhaps based on a selection) grouped by some other facet of informatuion such as a common function or pathway - really a visualization based on the facet of location is all a browser is today.

I like all of Kirt's ideas - and I'm wondering how best to turn this into action; I think the best place to start would be a concept document and/or storyboard of what the capabilities and user experience would be. I think the guiding philosophy should be to build a syatem that (as Kirt syas) 'enables everything but implements some' - so we only need to be clear about the core and the extensibility model. The other general guideline I would like to see is copious input from current browser users - what does a browser need to do to be useful? Which browsers are the best right now? What features are particularly needed, and where are the pain points? If technology wasn't an issue, what would the perfect browser do?

Anyone have direct user input, or suggestions on the best way to get it?

Nov 4, 2011 at 1:34 AM
Edited Nov 4, 2011 at 1:45 AM

This sounds like an interesting project to be involved with and I would enjoy participating. As I understand the discussion so far, the initial goal is to supplement the .NET BIO API with a set of visualization components that will support a genomic browsing capability. Developers would then incorporate customized tracks representing location or range-specific annotations (e.g. SNPs, exons, etc.). This new visualization extension would be based on a one or more visualization technologies such as SHO, or Chronozoom that would allow for a rapid change in genomic resolution.  As Vince noted earlier, the traditional linear representation used in most genome browsers could be extended by using something like Circos that would allow for visualizing data across multiple chromosomes concurrently.  While Circos diagrams find extensive use as a means of displaying genome-wide somatic mutations in a particular tumor, they could also be useful in displaying other multiple chromosome data such as a regulatory network.

Since the completion of the Human Genome Project, there has been an explosion in information regarding non-coding transcripts particularly with regards to siRNAs and miRNAs as well as increasingly complex models of transcription (e.g. alternative splicing, RNA editing, alternative transcription start sites, etc.).  A well-received paper authored by the ENCODE team suggested moving away from a protein centric model to a model that supports the extensive transcription activity of the genome (1). One example of transcription complexity is the tumor suppressor gene, PTEN, found at 10q23. Even a small loss of PTEN activity due to RNA interference has been implicated in a number of cancers. PTENP1 is a pseudo-gene located at 9p21 with extensive sequence homology with PTEN and its transcript is now thought to serve as a decoy for miRNAs targeting PTEN transcripts, thus rescuing PTEN activity (2). The point of this example is to demonstrate that one requirement for a new genome browser should be the ability to display complex and possibly variable transcription phenomena including those that involve multiple loci.

Wile it's a too early in the project's design process to focus on technology choices, Microsoft's Reactive Extensions API (http://msdn.microsoft.com/en-us/data/gg577609) deserves some consideration for providing an event handling framework.

Fred

(1) Genome Research 17:669-681 (2007)

(2) Science Signaling 3:1-3 (2010)

Nov 4, 2011 at 7:04 PM
Edited Nov 4, 2011 at 7:25 PM

To reply to Simon's questions (preface my answers by saying that I am a heavy user of the UCSC Genome Browser):

"what does a browser need to do to be useful?"

A lot of this deals with being able to add analyze or add your own data. Something that scientists do often :)

- load your own data (e.g. custom tracks)

- produce publication quality images (or perish!)

- alignment tools (e.g. BLAT).

- mining the underlying data (e.g. UCSC Table Browser).

"Which browsers are the best right now?"

Web-based: UCSC, Ensembl, GBrowse.

Standalone: IGV (I am less familiar with these).

"What features are particularly needed, and where are the pain points?"

This is related to the first question, but I'll try to be more specific:

- track-reordering by dragging ... before this was a pain.

- track grouping

- mouse over-events to give details.

I see most pain points being UI issues. If we get some experience UI designers on this we may have something neat!  

"If technology wasn't an issue, what would the perfect browser do?"

Offer full and fluid semantic zooming with full customization.  Like a fully customizable Deep Zoom image. 

Touch-based gestures.

"Anyone have direct user input, or suggestions on the best way to get it?":

One easy way is to look at browsers that integrate features that users request.  I know that the UCSC Genome Browser staff considers comments provided by users through their mailing list. If enough users ask for something then it gets implemented (if possible).  So, seeing features new features like track-reordering may be the results of user input.

-------------------

Just an idea, but something similar to this http://rstudio.org/ but for genomic data might be nice. Kinda like a desktop version of Galaxy (http://main.g2.bx.psu.edu/).

I'm going to shoot this out there and see if it sticks, but what is the target audience? Is it researchers analyzing publicly available data (such as the human genome), or researchers analyzing de novo projects? The answer to this question will help determine the ultimate platform we intend to target (web/desktop/mobile), which may feed back into requirements.

Coordinator
Nov 7, 2011 at 4:21 PM

All,

I have read and re-read these post. Thanks. I have thought about the question "Is it researchers analyzing publicly available data or researchers analyzing de novo projects" and I have reviewed Matlab, RStudio, Galaxy, Trident, and Microsoft's Reactive Extensions API, Circos, Pivot Viewer, Abyss, Chrono zoom, etc.

I think what I am proposing a a specialize Genomic Browsing library (GB lib) that fits in the .NET Bio library to allow simple viewing of genomic data. I want it to be simple for the developer to utilize, simple for the user to navigate, and simple to customize. Just like Venn diagrams, Heat maps, Circos plots, etc, they have created specialized graphs, controls, and metaphors to deal with specific types of information. This project would provide the metaphor for genomic navigation as a library.

This allows us to slice through the questions easily. Scripting should be enabled by other libraries that would use the GB API. Who is the audience? Researcher or casual user? Well, both because GB lib is on a more generic level. We leave this decision up to the application developer, but give him the tools to do what is necessarily. The same answer for Web/Desktop/Mobile. We will not preclude any platform and thereby enable all. We want a very rich event mechanism, and I think Microsoft's Reactive Extensions API would provide it, but this is early for that decision. I guess the target audience is the application builder.

The spec for this project would talk about different applications, all supported by the GB library. The library would be driven by the needs of the applications. It is important that the library be driven by "expected" features or features that allow intuitively simple navigation. The navigation suggest by vforget would provide the inspiration. Additionally, the application specific feature mentioned by fcriscuolo for miRNA would validate that we are developing something that is useful and flexible to all types of genomic viewing. We might even have to split by app and library development.

I would like to reduce this discussion down to requirements, technologies, and design decisions. We need a design repository. Somewhere that we can share class diagrams, architecture diagrams, and similar design documents that can be shared and updated by all. We need to collect/extract/ organized the thoughts expressed here an mature them. I feel this is the next step so that we can get group consensus. If anyone has suggestion, please provide.

Also, If you think this separation between application GB library is not the correct direction, please provide feedback. I/we have bitten off a big bite, but I don't expect one person to chew it all on their own.

Coordinator
Nov 7, 2011 at 5:29 PM

Vince, I'm not sure who the intended audience is exactly, I know what I would like to do with a browser, but I don't have any reason to believe I am a typical user in that respect. This, IMHO, is a further argument for a component-based approach - if we do this correcly, it would be possible to construct applications that meet the needs of different groups from a common base.

I would add one thing to Kirt's requests - it would be great to get something working end-to-end early on, so the first components selected for implementation should support a clear and simple use case. Any suggestions for what it could be?

Nov 16, 2011 at 3:20 PM

Considering we are taking a component-based approach we can divide the largest components into two parts: 1) the UI components that make up the browser and 2) the components that make up the visualization. Do we want to tackle both of these at the same time? 

A simple usage case for just the 2nd point (components that make up the visualization) would be:

Generate a genome image (figure for manuscript) that contains features such as genes (blocks) expression levels (graph) that is derived from data stored in Excel (or tab-delimited plain text).

Surprisingly, there are not many standalone tools that do this and most are either web-based or linux-based or require some work to install:

Examples: 

http://wishart.biology.ualberta.ca/cgview/application.html

http://circos.ca/

http://genome.crg.es/software/gfftools/GFF2PS.html

 

Core components would a reference, track, and feature component:

reference: name reference genome, chromosomes and their sizes. Sequence of genome?

track (member of reference): type (blocks or graph), color, visibility (dense, pack, full ala UCSC).

feature (member of track): chromosome, start coord, end coord, color (if set overrides color)

Features we can provide that are not present in existing tools:

- Excel integration.

- Conversion of output images to multiple formats (PNG, PDF, SVG, visio?).

- Direct editing of features via the UI.

- Export/link to MS word (e.g. provide functionality to insert a figure caption that is compatible with formatting in MS Word).

- Ease of installation and setup (install wizard, genome setup dialogs, etc).

- Toggle between linear and circular views.

Basically, build components necessary to facilitate the generation of genome figures for publication.

This is somewhat of a departure from a "genome browser", but some of these features may be simple enough to add to the core components (track, features, etc) would be useful to the building of a genome browser.

Vince

 

Nov 18, 2011 at 9:18 PM

I have started development of a DAS1 client API for use by .Net Bio developers. My assumption is that such an API might of be use to the genome browser project as a means of accessing data from the large number of DAS1 servers. The API will follow the design of similar Java-based DAS clients such as the dasobert project and will support the DAS 1.6 specification. While the new DAS/2 specification does provide some enhancements, there are only of handful of DAS/2 servers online compared to over 1800 DAS1 servers and DAS/2 servers are not currently supported within the DAS registry. I intend to make the client API available via a CodePlex project as soon as I can meet the 30 day release deadline. The alpha release will not have support for DAS registry operations, but these will be supported by the beta release. I plan to support 2 levels of interaction. A basic level will allow a user to specify a DAS server URI, entity identifier, and required parameters directly. A service layer will allow a user to make requests at a higher level of abstraction (e.g. GetReferenceSequenceByGeneSymbol). Responses from both levels will be encapsulated within an object or object graph, probably mapping to classes within the Bio.IO.GenBank namespace, rather than the XML document returned by DAS servers. I also see the service layer as a means of further data aggregation. For example, where a DAS response includes a PubMed id, a additional service operation could obtain the details for that reference from a PubMed Web service. Your comments and suggestions regarding this project, especially its integration with the genome browser, will be appreciated.

Fred