Help get this topic noticed by sharing it on Twitter Twitter, Facebook Facebook, or email.
happy I’m excited

Please Enter Our Bioinformatics Challenge!



Challenge Guidelines
Date: September 10, 2012

I) PRIZE: $2,500

II) CHALLENGE PERIOD:

A) Phase I:
i) BEGINS: 9:00 AM (0900) PST, Wednesday, September 12, 2012
ii) ENDS: 5:00 PM (1700) PST, Monday, November 27, 2012
B) Phase II:
i) BEGINS: 9:00 AM (0900) PST, Tuesday, November 28, 2012
ii) ENDS: 5:00 PM (1700) PST, Monday, December 7, 2012

III) PURPOSE: To incentivize global submission of comments for the refinement of the Competition Validation Protocol, which is the mechanism by which we will declare a winner(s) of the $10 million Archon Genomics X PRIZE presented by Express Scripts (Competition).

IV) GUIDELINES FOR ONLINE SUBMISSIONS:

A) CHALLENGE: Submit comments that provide insight into potential false positives and negatives. By combining comparison information between the unified call set and external resources, we can identify 17,654 fosmid variants (10%) not found in both the Complete Genomics and Illumina datasets. Additionally, Illumina and Complete Genomics combine to call 1,228 variants (0.7%) that are not in the fosmid call set. Perform additional in-depth analysis to classify either uniquely identified fosmid calls or potential false positives, or false negative calls in the individual technologies.

B) CHALLENGE: Improve variant representation and assessment. The variation software framework works hard to make variant representations as uniform as possible. Indels are especially challenging and we welcome practical examples of regions that need additional standardization.

C) CHALLENGE: Refine approaches to unifying variant calls: What we learn from the additional inspection of discordant variants can help inform improved approaches to filtering. This is a great opportunity to develop generalized, reusable methods for combining variants from multiple approaches.

D) CHALLENGE: Test and propose specific refinements to the software, web platform, or report functionality of the software.

V) INSTRUCTIONS:

A) Each Contributor shall register on the Competition Scientific Community according to the Rules stated below
B) Each Contributor must visit the Competition Public Phase wiki to obtain the technical details required for participation by visiting (http://bit.ly/NGVbIs)
C) Following the conclusion of analysis prior to the conclusion of Phase I, each contributor shall provide written comments directly on the Competition Scientific Community (this site) under the Caption “Please Enter Our Bioinformatics Challenge” to be considered a valid written submission. Registered users are allowed to provide written submissions on multiple, (up to four) Challenges.
D) Each Contributor may also submit clarifying questions directly in the appropriate section of the Competition Scientific Community
E) For specific technical questions, please contact Justin Johnson at EdgeBio via email at mail to: JJohnson@edgebio.com. Please include “BIOINFORMATICS CHALLENGE QUESTION” in subject line
F) For questions relating to the Rules and Guidelines of the Bioinformatics Challenge, please contact Grant Campany at grant.campany@xprize.org and include “BIOINFORMATICS CHALLENGE QUESTION” in subject line

VI) RULES:

A) Only one cash prize of $2,500 USD (Prize) will be awarded at the sole discretion of the X PRIZE Foundation based upon written submissions via "Reply to This Topic" section below (this site)
B) To qualify, each person is required to provide his/her given/legal name as a user name and may not have more than one user name registered on the Competition Scientific Community site (Contributor)
C) Each Contributor acknowledges that only submitted comments received via the "Reply to This Topic" section below during Phase I of the Challenge Period will be considered for the Prize
D) Each Contributor may submit multiple comments during Phase I of the Challenge Period
E) Phase I will be considered the “comment submission” phase of the Challenge Period. During Phase I each Contributor shall submit written comments on the online Scientific Community (this site). The X PRIZE Foundation shall identify a list of finalists to enter into Phase II. The X PRIZE Foundation shall select the finalists of the Bioinformatics Challenge based on material contribution per section IV above.
F) Phase II shall be considered the “voting and selection” phase of the Bioinformatics Challenge. The X PRIZE Foundation shall present submissions by the finalists to the Scientific Community (this site) for final voting. The submission receiving the most unique positive votes shall determine the winner of the Prize.
G) All Contributors hereby acknowledge and agree to:
i) The Guidelines of the Bioinformatics Challenge stated herein,
ii) The declaration of the winner is at the sole discretion of the X PRIZE Foundation, and
iii) The winner of the Bioinformatics Challenge shall agree to participate in commercially reasonable public relations efforts deemed appropriate between the X PRIZE Foundation and the winner
iv) The RULES are subject to change, so please check this site for most up-to-date version of Guidelines as indicated by the Date above.
2 people like
this idea
+1
Reply
  • Brendan O'Fallon
    happy I’m happy
    One idea to improve the quality of variant calling that I've been experimenting with lately is a pure-SVM (support vector machine) approach. The basic idea is to eschew all attempts at modelling the bases at a site as a binomial or Poisson process. Instead, collect lots of data about each potentially variant site, and train a model using a set of trusted variant and trusted invariant sites. Then use the model to call variants on test data sets.
    I think this approach has merit for a few reasons. First, instead of a sequential process of calling potential variants and then deciding which ones are real or not based on statistics like strand bias, read depth, var. frequency, read position, etc., the SVM model can bring all of these statistics (and many more) together to produce a estimate of the probability in a single coherent model. SVMs are good at handling lots of data and very-high dimensional systems, so its easy to simply throw in every statistic that one suspects could influence the probability of being real and let the machine learn which ones are important from the training data. In the test cases I've run so far this approach is about twice as fast as the GaTK's UnifiedGenotyper (and likely much faster than the HaplotypeCaller, from what I've heard). The downside is that the model must be trained - but this isn't too hard. True variants often aren't too hard to recognize - variants at 10% or greater frequency in 1000 Genomes are pretty unlikely to be false, for instance.
    I'm happy to share my code if you're interested. Currently only a SNP caller has been implemented, but it works. The code's on on github at http://github.com/brendanofallon/SNPSVM. It also requires downloading lib-svm (open source, no dependencies, works great) to do the svm stuff.
    Cheers,
    Brendan
    brendan.d.ofallon@aruplab.com
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. indifferent, undecided, unconcerned happy, confident, thankful, excited sad, anxious, confused, frustrated kidding, amused, unsure, silly

  • Brad Chapman
    Brendan;
    This is brilliant, thanks for sharing. I'm 100% agreed on the usefulness of machine learning classifiers and we currently have an SVM based approach in the consolidation pipeline, post variant-calling, that uses Weka to do the classification:

    https://github.com/chapmanb/bcbio.var...

    We train separate 4 separate classifiers based on properites of the variants: SNPs vs indels and repetitive versus non-repetitive regions, and are actively working on identifying useful metrics that help discriminate.

    It's exciting to have someone else using similar tactics and I'd love to talk more and collaborate on this. As you've mentioned, the trickiest problem right now is getting good quality training sets of true and false positives. What approaches are you taking for identifying false positive inputs? Which metrics have you found most useful for discrimination? Do you have any comparisons with your callsets versus GATK or other callers?

    Thanks again for this looking forward to talking more,
    Brad
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. indifferent, undecided, unconcerned happy, confident, thankful, excited sad, anxious, confused, frustrated kidding, amused, unsure, silly

  • Brendan O'Fallon
    The best training data I've made so far has come from an exome sample that was sequenced twice independently. Initial variants were called with the UnifiedGenotyper with very low quality threshold ( -stand_emit_conf 1.0). To identify true positives, I look for variants that were called in both runs, have high quality (q50 or greater) and that have been seen in both 1000 Genomes and the ESP5400 data set. For false positives, I look for variants that are unique to one run or the other, that are low quality (less than q20), and that have never been seen in 1000 Genomes or ESP5400. The vast majority of these false positives are of very low depth, so I do some filtering to increase the number higher-depth false positives. I get around 12000 true positives and around 4000 false positives this way. I also throw in a few thousand "invariant" sites as false training data so the trainer can recognize sites at which no variation exists as invariant.
    Currently I'm working with a list of about 15 statistics, although most of these emit more than one number per site. Here's a few of the most the ones I believe are most informative
    - Relative probability that the number of variant reads was sampled from a binomial distribution with p = 0.5 or p=0.99 (pretty much just like the UnifiedGenotyper)
    - Mean read position of variant bases at a site
    - Variance in read position of variant and invariant bases
    - Mean mapping quality of reads with a variant and reads without a variant
    - Mean base quality of adjacent sites for both reference and variant bases
    - Longest homopolymer run on reference in both directions
    - Longest dinucleotide run in both directions
    - Simple measure of sequence complexity in region
    - Mean number of mismatches on reads with variant base and reference bases

    I'm experimenting with more of these - the nice thing about the SVM approach is that it there's no reason not to throw in tons of data. It figures out what's important.
    I do have a bunch of comparisons with the GaTK, but so far just looking at things like Ti / Tv ratios. Here's a recent set of results from a NA12878 exome:

    GaTK (q30) SNPSVM (q10)
    Toal SNPs 42536 41821
    Ti / Tv 2.45 2.51
    Unique SNPs 1084 369
    Ti / Tv Unique 0.89 1.17

    So SNPSVM is a bit more specific, calling in total 800 fewer SNPs (2%), but the Ti Tv is somewhat better, and the unique calls probably have a greater fraction of true variants than the unique calls from the GaTK. Results are also favorable when compared to the sites called in NA12878 from HapMap v3. SNPSVM has higher sensitivity and specificity when looking at those sites. We're doing a bunch of Sanger-ing in the near future to determine whats going on with the discordant sites.
    Ideally, things will continue to improve as we create better training sets, add in more statistics, and fine-tune parameters for the SVM training.
    Any ideas for additional statistics to include?
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. indifferent, undecided, unconcerned happy, confident, thankful, excited sad, anxious, confused, frustrated kidding, amused, unsure, silly

  • Brad Chapman
    Brendan;
    Thanks for the great summary. We're using a lot of overlapping metrics, with the main difference that we've been scaffolding off of GATK's AnnotateVariants approach. In addition to GATK's standard metrics we have some custom annotations that tackle similar areas:

    - Justin Zook at NIST wrote annotators for position of variants within reads: https://github.com/chapmanb/bcbio.var...
    - We calculate neighboring base quality with a annotator written in Clojure: https://github.com/chapmanb/bcbio.var...
    - For repeats issues, I've been looking at entropy of the region (https://github.com/chapmanb/bcbio.var...), problematic secondary structure (https://github.com/chapmanb/bcbio.var...) and genome mappability scores (http://sourceforge.net/apps/mediawiki...)

    Repeats and indel are the two areas where I'm currently finding the most difficult issues. Do you have a sense of how many of your differing variants are in repetitive regions?

    More generally, it would be useful to compile a list of questionable sites that GATK and SNPSVM differ on and determine if any of the metrics help identify these regions. I'm very interested in being able to quantify difficult to assess regions this way, which would be useful to identify potential blind spots in the genome where callers are likely to differ.

    On the SVM side, have you experimented with different kernels or other tuning parameters?

    Thanks again,
    Brad
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. indifferent, undecided, unconcerned happy, confident, thankful, excited sad, anxious, confused, frustrated kidding, amused, unsure, silly

  • Brendan O'Fallon
    Quickly looking over sets of discordant variants from the GaTK and SNPSVM a couple of things jump out. GaTK calls a lot of pretty strand-biased sites that are not called by SNPSVM (not surprisingly, I don't think the UG takes strand bias into account at all). SNPSVM makes more calls at sites with spanning deletions, which the GaTK appears hesitant to do. Not sure if most of those are false or not. SNPSVM also seems a bit bold about calling SNPs in homopolymer runs, which I'm guessing are mostly false.

    Haven't tried too many different kernels - just a linear and an RBF kernel. In my reading about libsvm the kernel RBF is presented as a logical choice. libsvm also includes a script that performs a simple grid search over two tuning parameters to determine the combination with the best cross-validation accuracy, I've run this a couple of times to find the best tuning parameters (although I don't re-run for every new statistic I collect).
    I really like the idea of using 'outside' metrics, such as the genome mappability score you mentioned, as additional information. That's the beauty of the SVM approach, it's so easy to toss in additional parameters and see what happens.
    One difficulty I'm currently having is identifying ways to visualize or summarize the results of a training run. For instance, it would be nice to know which statistics are the most helpful, or somehow project the points into 2D space so clusters could be more easily identified by eye. Have you run across any tools to do this? libsvm doesn't provide much, and I'm pretty inexperienced in the machine learning world.
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. indifferent, undecided, unconcerned happy, confident, thankful, excited sad, anxious, confused, frustrated kidding, amused, unsure, silly

  • Brad Chapman
    Brendan;
    Thanks for this, it's fun to be discussing with others working on the same problems. I'll look into libsvm more, the parameter search would be useful to help optimize tuning.

    For your GATK comparisons, it might be worth running through their full best practice pipeline including VQSR. Raw UnifiedGenotyper output is known to have false positives that can be identified and filtered via the variant recalibration step. This should help resolve some of the strand bias issues you noticed and limit the discrepancies.

    For visualization, HSPH and EdgeBio are developing a framework for these type of problems, allowing linked exploration of multiple metrics simultaneously. It's still in early development, but we have a demo server here:

    http://variantviz.rc.fas.harvard.edu/

    And a couple of videos of it in action here:

    http://bcbio.wordpress.com/2012/09/17...

    Thanks again for all the discussion,
    Brad
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. indifferent, undecided, unconcerned happy, confident, thankful, excited sad, anxious, confused, frustrated kidding, amused, unsure, silly

  • Chalisa Srichannon
    I have very simple thought about how to use VCF file.
    One idea is that SVM is O(n^3), and the ideal case must be O(n), so to simplify it. I think I am going for a linked list by making a simple term of single equation of my own Primewalk on the DNA strain, spliced strain...and making this to a unique sum. Knowing this you could actually detecting the Chromosome, and the SNPs on each strain.

    Simplify this to the unique sum as mentioned prior, making the database linked list and using binary search to find the most matched DNA strain, defining SNPs algorithm from Primewalk, might takes time for solving equation, which is also a linear and might not be bad at all.

    Using a semantic web is also a good idea, but I think it is very hard to train to make a pattern recognition and make things worse and slower. <--this is not my answer.

    I am about to making copyright on Primewalk.
    If anyone wants to discuss about this, please mail me at chalisas824@live.co.uk.
    BR,
    Chalisa Srichannon
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. indifferent, undecided, unconcerned happy, confident, thankful, excited sad, anxious, confused, frustrated kidding, amused, unsure, silly

  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. indifferent, undecided, unconcerned happy, confident, thankful, excited sad, anxious, confused, frustrated kidding, amused, unsure, silly

  • Brad Chapman
    Chalisa;
    Thanks for the ideas and link to your work. Have you applied Primewalk to variant data calling? If so, we'd be happy to look at your results and see if there is any fit for the work we're doing. We have several public datasets available for X Prize you are welcome to experiment with. Thanks again,
    Brad
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. indifferent, undecided, unconcerned happy, confident, thankful, excited sad, anxious, confused, frustrated kidding, amused, unsure, silly

  • Chalisa Srichannon
    happy I’m thankful, and excited.
    I have different set of data on Primewalk. I use these following files: NT_005612.16, NT_022459.15, NT_022517.18, NT_029928.13 this is for Chromosome 3.

    Could you please show me the correct datasets that will be use on X Prize for splice sites esp. VCF file?
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. indifferent, undecided, unconcerned happy, confident, thankful, excited sad, anxious, confused, frustrated kidding, amused, unsure, silly

  • Chalisa Srichannon
    happy I’m thankful, and excited.
    Anyone who is interested in Primewalk idea, please contact me.

    Chalisas824@live.co.uk

    BR,
    C.S.
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. indifferent, undecided, unconcerned happy, confident, thankful, excited sad, anxious, confused, frustrated kidding, amused, unsure, silly

  • Brad Chapman
    Chalisa;
    Absolutely, all the data is available from GenomeSpace for more analysis. The wiki has detailed information about the callsets and links to the downloads:

    https://edgebio.atlassian.net/wiki/di...

    Brad
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. indifferent, undecided, unconcerned happy, confident, thankful, excited sad, anxious, confused, frustrated kidding, amused, unsure, silly

  • Chalisa Srichannon
    Hi,

    My idea is finding a single formula to represent the whole datasets.

    1. About the prime walk, the sum(Xk*Wk) is the same as the FFT Radix-N representation, which we could know the backward representation of the X[n] with single number, so we could do this type of sorting with much better O(nlogn).
    Noted that the most-likelihood of Y, or single number representation of combination of Xs is 42%. Therefore, I need (2).
    2. I do the linked list to make the depth and node of the datasets, as it walks along the length of the gene.
    3. By subtracing, the Radix-N we could have a backward solution for SNPs.
    4. Another approach is using polynomials and finding sets of goebner basis, this could also apply with the graph coloring idea which is quite complicated for having done all the datasets.

    I am not really familiar with the datasets as this is my new topic also that I do on my study, and I I'm making the program and documentations. I will post some more useful approach that I do on the github and share them, please feel free to comment on this approach.

    It would be easier if I start all over, If anyone has a book on the opensource for this toolkit, please reply.

    C.S.
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. indifferent, undecided, unconcerned happy, confident, thankful, excited sad, anxious, confused, frustrated kidding, amused, unsure, silly

  • PHASE I is now CLOSED. Thank you for providing comments!

    Grant
  • (some HTML allowed)
    How does this make you feel?
    Add Image
    I'm

    e.g. indifferent, undecided, unconcerned happy, confident, thankful, excited sad, anxious, confused, frustrated kidding, amused, unsure, silly