BroadE Workshop 2013 July 9-10: Best Practices for Variant Calling with the GATK

August 13, 2013, 12:37 am

≫ Next: Slides from July 2013 workshop

≪ Previous: very different SNP calling by GATK and samtools

The full GATK team presented this workshop at the Broad Institute with support form the BroadE education program.

This workshop covered the core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. View the workshop materials to learn why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

↧

Slides from July 2013 workshop

July 9, 2013, 8:54 am

≫ Next: problem with the "Call Variants" video in the GATK Best Practices webinar

≪ Previous: BroadE Workshop 2013 July 9-10: Best Practices for Variant Calling with the GATK

The slides from the July 2013 Best Practices workshop are available here:

https://www.dropbox.com/sh/jhlk451jntywfdy/vJqbKTZZd_

The videos will be put online once they are processed.

↧

problem with the "Call Variants" video in the GATK Best Practices webinar

August 19, 2013, 1:54 am

≫ Next: Adding more sample

≪ Previous: Slides from July 2013 workshop

I am not sure where I should ask this question, but the GATK forum seemed the most appropriate place. I am currently viewing the webinars from the BroadE Workshop 2013 July 9-10 and the "Call Variants" streaming doesn't seem to work:

http://www.broadinstitute.org/videos/broade-calling-variants-0

Would you have a work around to view this file? or is the file inaccessible for some reason?

↧

Adding more sample

September 5, 2013, 2:53 am

≫ Next: The new Best Practices section is online!

≪ Previous: problem with the "Call Variants" video in the GATK Best Practices webinar

We have 50 human whole genome data (12x) from the same population and we have already used GATK to call variants and followed the best practice for population level calling. We have then sequenced 20 more individual (12x) from the same population. For the population level variant calling do we have to put all the BAM files (50+20) together and then call the variants(in UnifiedGenotyper) ? Or we can somehow use the VCF file which was created from the previous 50 individual and use them as training set and call the variant for the new 20 individuals?

↧

The new Best Practices section is online!

September 13, 2013, 2:22 pm

≫ Next: Registration open for the GATK workshop (Oct 21-22)

≪ Previous: Adding more sample

We're very pleased to announce that we have finally finished our big rewrite of the Best Practices documentation. We hope that the new format, which you can find here, will prove more user-friendly, searchable and overall more helpful than the previous version.

We have a few more improvements in mind (e.g. a clickable image map of the workflow) and there may be a few bugs here and there to iron out. So please feel free to comment on this announcement and give us feedback, whether flattering or critical, so we can improve it to help you as much as possible.

↧

Registration open for the GATK workshop (Oct 21-22)

September 17, 2013, 11:18 am

≫ Next: Presentation slides for the Oct. 2013 workshop are online

≪ Previous: The new Best Practices section is online!

http://www.cvent.com/events/broade-workshop-gatk-best-practices-building-analysis-pipelines-with-queue/event-summary-de1eaa027413404ba6dc04c128d52c63.aspx

This workshop will cover the following topics:

GATK Best Practices for Variant Detection
Building Analysis Pipelines with Queue

The workshop is scheduled right before ASHG Boston, so if you're going to be in town for the conference, make sure you come a couple of days early and attend the GATK workshop!

↧

Presentation slides for the Oct. 2013 workshop are online

October 21, 2013, 7:40 pm

≫ Next: Presentation videos from Best Practices & Queue workshop (Oct 2013)

≪ Previous: Registration open for the GATK workshop (Oct 21-22)

Slides for :

Day 1 (Best Practices for Variant Calling with GATK)
Day 2 (Building Analysis Pipelines with Queue)

are available at this link in our DropBox slide archive :

https://www.dropbox.com/sh/ryz7bx4aeqc7mkl/44Q_2jvIyH

↧

Presentation videos from Best Practices & Queue workshop (Oct 2013)

December 20, 2013, 2:57 pm

≫ Next: Interpreting VQSLOD and Tranche Quality in a Non-Human Model Organism

≪ Previous: Presentation slides for the Oct. 2013 workshop are online

The presentation videos for:

Day 1 (Best Practices for Variant Calling with GATK)
Day 2 (Building Analysis Pipelines with Queue)

are available here: http://www.broadinstitute.org/gatk/guide/events?id=3391

↧

Interpreting VQSLOD and Tranche Quality in a Non-Human Model Organism

December 30, 2013, 6:11 pm

≫ Next: The GATK Best Practices for variant calling on RNAseq, in full detail

≪ Previous: Presentation videos from Best Practices & Queue workshop (Oct 2013)

Hello there! Thanks as always for the lovely tools, I continue to live in them.

Been wondering how best to interpret my VQSLOD plots/tranches and subsequent VQSLOD scores.
Attached are those plots, and a histogram of my VQSLOD scores as they are found across my replicate samples.

Methods Thus Far

We have HiSeq reads of "mutant" and wt fish, three replicates of each. The sequences were captured by size selected digest, so some have amazing coverage but not all. The mutant fish should contain de novo variants of an almost cancer-like variety (TiTv independent).

As per my interpretation of the best practices, I did an initial calling of the variants (HaplotypeCaller) and filtered them very heavily, keeping only those that could be replicated across all samples. Then I reprocessed and called variants again with that first set as a truth set. I also used the zebrafish dbSNP as "known", though I lowered the Bayesian priors of each from the suggested human ones. The rest of my pipeline follows the best practices fairly closely, GATK version was 2.7-2, and my mapping was with BWA MEM.

My semi-educated guess..

The spike in VQSLOD I see for samples found across all six replicates are simply the rediscovery of those in my truth set, and those with amazing coverage, which is probably fine/good. The part that worries me are the plots and tranches. The plots don't ever really show a section where the "known" set clusters with one set of obviously good variants but not with another. Is that OK or does that and my inflated VQSLOD values ring of poor practice?

↧

The GATK Best Practices for variant calling on RNAseq, in full detail

March 5, 2014, 11:24 pm

≫ Next: GATK workshop in Brussels, Belgium: June 24-26 (2014)

≪ Previous: Interpreting VQSLOD and Tranche Quality in a Non-Human Model Organism

We’re excited to introduce our Best Practices recommendations for calling variants on RNAseq data. These recommendations are based on our classic DNA-focused Best Practices, with some key differences in the early data processing steps, as well as in the calling step.

Best Practices workflow for RNAseq

This workflow is intended to be run per-sample; joint calling on RNAseq is not supported yet, though that is on our roadmap.

Please see the new document here for full details about how to run this workflow in practice.

In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller.

Now, before you try to run this on your data, there are a few important caveats that you need to keep in mind.

Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have only been working with RNAseq for a few months, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.

For one thing, these recommendations are based on high quality RNA-seq data (30 million 75bp paired-end reads produced on Illumina HiSeq). Other types of data might need slightly different processing. In addition, we have currently worked only on data from one tissue from one individual. Once we’ve had the opportunity to get more experience with different types (and larger amounts) of data, we will update these recommendations to be more comprehensive.

Finally, we know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.

We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data. We look forward to hearing your thoughts and observations!

↧

GATK workshop in Brussels, Belgium: June 24-26 (2014)

June 11, 2014, 1:34 pm

≫ Next: A single pass mechanism to add multiple annotations to vcf

≪ Previous: The GATK Best Practices for variant calling on RNAseq, in full detail

Calling all Belgians! (and immediate neighbors)

In case you didn't hear of this through your local institutions, I'm excited to announce that we are doing a GATK workshop in Belgium in two weeks (June 24-26 to be precise). The workshop, which is open and free to the scientific community, will be held at the Royal Institute of Natural Sciences in Brussels.

This is SUPER EXCITING to me because as a small child I spent many hours drooling in front of the Institute Museum's stunningly beautiful Iguanodons, likely leaving grubby handprints all over the glass cases, to the shame and annoyance of my parents. I also happen to have attended the Lycee Emile Jacqmain which is located in the same park, right next to the Museum (also within a stone's throw of the more recently added European Parliament) so for me this is a real trip into the past. Complete with dinosaurs!

That said, I expect you may find this workshop exciting for very different reasons, such as learning how the GATK can empower your research and hearing about the latest cutting-edge developments that you can expect for version 3.2.

See this website or the attached flyer for practical details (but note that the exact daily program may be slightly different than announced due to the latest changes in GATK) and be sure to register (it's required for admission!) by emailing cvangestel at naturalsciences.be with your name and affiliation.

Please note that the hands-on sessions (to be held on the third day) are already filled to capacity. The tutorial materials will be available on our website in the days following the workshop.

↧

A single pass mechanism to add multiple annotations to vcf

June 10, 2014, 1:57 pm

≫ Next: Handling Multimappers

≪ Previous: GATK workshop in Brussels, Belgium: June 24-26 (2014)

Hi,

I was wondering if there is a nice way to apply multiple processing steps to each variant (or a group of variants) as they are read so that the variant file is not read again and again. My understanding is that even if I use Queue, each script would read the vcf again. Is that correct?

↧

Handling Multimappers

July 31, 2014, 7:05 am

≫ Next: RNA-Seq variants from 2-step alignments

≪ Previous: A single pass mechanism to add multiple annotations to vcf

Hi,

I am wondering how HaplotypeCaller in GATK 3.1 handles multimappers? I thought I read that they are passed over for variant calling but stay in the realigned, recalibrated BAM for 'completions sake', like marking duplicates not removing them but cannot find supporting info on the website or from farther afield,

Presumably it is the NotPrimaryAlignmentFilter but there is no info on that posted yet. I know I can output a BAM from HC with haplotype info in there but can I just get reads used in variant calls? Or should I trim the BAM myself to retain reads I want used? I do this for mark duplicates (removed) but for multimappers I would like to know how you define so I can do the same. The reason is for coverage estimates, using bamtobed or such means I take all realigned, recalibrated which is many more lines including multimappers which skews my results.

Thanks,

Bruce.

↧

RNA-Seq variants from 2-step alignments

August 14, 2014, 12:37 pm

≫ Next: Workshop announcement: Philadelphia, September 18-19

≪ Previous: Handling Multimappers

Hi,

Thank you for providing guidelines on RNA-Seq variant discovery. For our data, we are currently playing with multiple mapping methods and have noticed that 2-step alignments work "better" than 1-step alignments. By 2-step alignments, I mean using STAR as step1 and then take the unmapped from this and use another aligner (like Bowtie2) for alignment. If I use such a methodology, will there be an issue in variant calling when during splitting cigar strings I ask it convert the 255 MAPQ to another value (like 60 in the best practices example), since bowtie2 gives different MAPQ scores. Sorry if this seems like a stupid question, but I am just a little curious how such a thing might affect the variant calls. Any insights/comment on this will be greatly appreciated.

Thanks!

↧

Workshop announcement: Philadelphia, September 18-19

August 28, 2014, 6:44 am

≫ Next: Intorduction to the GATK Best Practices workflows

≪ Previous: RNA-Seq variants from 2-step alignments

Here is the official announcement for the upcoming workshop in Philadelphia. Registration is not necessary for the lecture sessions, but it is required for the hands-on sessions (see link further below).

We look forward to seeing you there!

The Center for Genetics and Complex Traits (CGACT) and the Institute for Biomedical Informatics (IBI) of the University of Pennsylvania Perelman School of Medicine announce a Workshop on Variants Discovery in Next Generation Sequence Data on September 18 and 19, 2014.

This workshop will focus on the core steps involved in calling variants with the Broad Institute¹s Genome Analysis Toolkit (GATK), using the "Best Practices" developed by the GATK team, and will be presented by Dr. Geraldine Van der Auwera of the Broad Institute and other instructors from the GATK team. Participants will learn why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of their dataset.

The workshop will take place over two consecutive days (September 18 and 19, 2014). In the morning lecture sessions, attendees will learn the rationale, theory, and real-life applications of GATK Best Practices for variant discovery in high-throughput DNA sequencing data, including recommendations for additional experimental designs and datatypes such as RNAseq. In the afternoon hands-on sessions, attendees will learn to interact with the GATK tools and apply them effectively through interactive exercises and tutorials.

The morning lecture sessions will take place on Thursday, September 18, from 9:00 am to 12:30 pm, and Friday, September 19, from 9:00 am to 11:30 am, in the Dunlop Auditorium of Stemmler Hall, University of Pennsylvania, 3450 Hamilton Walk, Philadelphia, PA 19104. Both morning sessions are open to all participants and registration is not required.

The afternoon hands-on sessions will take place on Thursday, September 18, from 2:00 pm to 5:30 pm, and Friday, September 19, from 1:00 pm to 4:30 pm. The September 18 hands-on session is aimed mainly at beginners (though familiarity with the command line environment is expected). The September 19 hands-on session is aimed at more advanced users who are already familiar with the basic GATK functions. Attendance to the hands-on sessions is limited to 20 participants each day, and precedence will be given to members of the University of Pennsylvania or its affiliated hospitals and research institutes (HUP, CHOP, Monell, Wistar, etc.).

Registration for the hands-on sessions is mandatory and open through Friday, August 29th at http://ibi.upenn.edu/?p=996.

↧

Intorduction to the GATK Best Practices workflows

April 16, 2014, 2:58 pm

≫ Next: VCF files for Indels and SNPs

≪ Previous: Workshop announcement: Philadelphia, September 18-19

This article is part of the Best Practices documentation. See http://www.broadinstitute.org/gatk/guide/best-practices for the full documentation set.

The "GATK Best Practices" are workflow descriptions that provide step-by-step recommendations for getting the best analysis results possible out of high-throughput sequencing data. At present, we provide the following Best Practice workflows:

These recommendations have been developed by the GATK development team over years of analysis work on many of the Broad Institute's sequencing projects, and are applied in the Broad's production pipelines. As a general rule, the command-line arguments and parameters given in the documentation examples are meant to be broadly applicable.

Important notes on context and caveats

Our testing focuses largely on data from human whole-genome or whole-exome samples sequenced with Illumina technology, so if you are working with different types of data or experimental designs, you may need to adapt certain branches of the workflow, as well as certain parameter selections and values. Unfortunately we are not able to provide official recommendations on how to deal with very different experimental designs or divergent datatypes (such as Ion Torrent).

In addition, the illustrations and tutorials provided in these pages tend to assume a simple experimental design where each sample is used to produce one DNA library that is sequenced separately on one lane of the machine. See the Guide for help dealing with other experimental designs.

Finally, please be aware that several key steps in the Best Practices workflow make use of existing resources such as known variants, which are readily available for humans (we provide several useful resource datasets for download from our FTP server). If no such resources are available for your organism, you may need to bootstrap your own or use alternative methods. We have documented useful methods to do this wherever possible, but be aware than some issues are currently still without a good solution.

Important note on GATK versions

The Best Practices have been updated for GATK version 3. If you are running an older version, you should seriously consider upgrading. For more details about what has changed in each version, please see the Version History section. If you cannot upgrade your version of GATK for any reason, please look up the corresponding version of the GuideBook PDF (also in the Version History section) to ensure that you are using the appropriate recommendations for your version.

↧

VCF files for Indels and SNPs

September 30, 2014, 12:53 pm

≫ Next: Question

≪ Previous: Intorduction to the GATK Best Practices workflows

Hi,
I am following the Best Practices for DNAseq analysis and have 2 quick questions:
1. I wanted to confirm if the VCF files produced after the VariantRecalibrator and ApplyRecalibration steps for SNP and for Indels are completely independent of each other. In other words, the VCF file produced after these two steps for SNPs (mode SNP) is just for SNPs and for Indels (mode INDEL) is just for Indels.
2. What is the source of the ALT alleles in the VCF file - is it the various annotation files that were used during the analysis?
Thanks,
- Pankaj

↧

Question

November 20, 2014, 3:41 am

≫ Next: Selecting an appropriate quality score

≪ Previous: VCF files for Indels and SNPs

Where can I get the fasta and ref files for experimenting on GATK?

↧

Selecting an appropriate quality score

October 18, 2012, 4:59 am

≫ Next: reference.fa

≪ Previous: Question

Hello,

I'm sorry if I'm being dense (I'm new to all this and it is making me feel very dense indeed!), but having read the section on 'Selecting an appropriate quality score threshold' on the 'Best Practice Variant Detection' page, I am still unclear as to whether you mean I should be looking for a QUAL score of at least 30 in a deep coverage data set and should filter out any suggested SNPs that don't meet this, or a GQ score of 30 in each individual sample genotyped at the SNP in question and I only need to filter out individual samples that don't meet this threshold.

Please can you clarify?

I have pasted the bit of text I read below, just to make it clear to which bit I am referring.

Many thanks!

A common question is the confidence score threshold to use for variant detection. We recommend:

Deep (> 10x coverage per sample) data: we recommend a minimum confidence score threshold of Q30.

Shallow (< 10x coverage per sample) data: because variants have by necessity lower quality with shallower coverage we recommend a minimum confidence score of Q4 in projects with 100 samples or fewer and Q10 otherwise.

↧

reference.fa

November 25, 2014, 6:27 am

≫ Next: Missing reference allele in GVCF file after running HaplotypeCaller

≪ Previous: Selecting an appropriate quality score

Chaps,

I am a newbie to GATK. I have started the best practices pipline but stuck in bwa. From where can I get the reference.fa file? The GATK bundle has .fasta .bam .fai but no .fa. Anyone please!

↧