Hi everyone, I'm new here :)
I'm a current Software Dev student and this semester I'm working on project with .NET Bio. I'm coming to this as a developer without much bioinformatics/biology background, so any suggestions or input on this project would be most welcome!
Essentially, I'll be working on an add-in to .NET Bio, and possibly also a stand-alone application, capable of processing clusters of read sequences and calculating various metrics from them. The goal is to analyse clusters (produced by a clustering algorithm
such as MCL) to determine the likelihood that each read in the cluster comes from the same genetic loci (i.e., determine how accurate the clustering process was).
I'm currently processing SAMAlignedSequences from a BAM file and will be calculating things like the number of haplotypes and genotypes represented in each sample and cluster. Based on this (and potentially on other metrics), as well as the quality score of
each read and the number of individuals represented in each cluster, etc., I hope to generate a rating score or scores to indicate for each cluster whether it is "good" or "bad". Based on this rating, "bad" clusters can be filtered
out before performing downstream analysis on the data.
As part of a ddRADseq pipeline, the purpose of this project is to improve on the cluster ploidy detection step mentioned in
By adding this functionality to .NET Bio, it will be more accessible for others to use, and can perhaps be extended to be useful for a wider range of applications. By posting here, in particular I wondered if anyone has suggestions for other uses a cluster
ploidy/accuracy calculation tool could be put to, or if there are any particular metrics you would be interested in seeing included.
I'm looking forward to fun times working on .NET Bio. :)