BQSR - perform on whole genome or single chromosomes OK?

July 21, 2015, 10:07 pm

≫ Next: How to speed up CombineGVCFs - seems unfeasibly slow?

≪ Previous: Cancer variant calling workflow with mutect and changes/additions to the GATK best practices

Hi,

I am running GATK Best Practices again and just wanted to assure myself that I am proceeding correctly. I remember reading somewhere on here that BQSR performs best when given the whole genome to work with (similar to VQSR in that respect). Better to just run using the whole genome rather than giving it single chromosomes. I couldn't really find this comment/recommendation when I tried to look for it again, so wanted to see what the final word was on this. How should one proceed?

Thanks,
Alva

↧

How to speed up CombineGVCFs - seems unfeasibly slow?

April 20, 2015, 12:51 pm

≫ Next: Tumor-Normal Paired Exome-Sequencing pipeline

≪ Previous: BQSR - perform on whole genome or single chromosomes OK?

Dear @Sheila & @Geraldine_VdAuwera,

I hope you had a good weekend! My question, to start this week (hopefully the only thread!), is how to use CombineGVCFs. As you may remember I currently have ~100 WES samples for joint genotyping, and I have been testing GenotypeGVCFs directly and VQSR. However, in the near future I am going to have several hundred more samples in my hand, and hence I understand that making batches using CombineGVCFs is the way forward, as this is the only way to allow for subsequent merging of new gVCFs once sample size passes 200.

So I have been trying to run it this afternoon on a node that has 16 CPUs and 128Gb of RAM. Initial attempt was with -nt 14, but this gave the following message.

ERROR MESSAGE: Invalid command line: Argument nt has a bad value: The analysis CombineGVCFs currently does not support parallel execution with nt. Please run your analysis without the nt option.

So, therefore I started it without any threading, but then it appears it is going to take ~75hrs by it's own estimate after running for almost 1hr.

Is this really how long it should take? I have been trying to decipher the various possible issues looking at old threads on the forum, but I am not sure if there is any way I should be able to speed this up (short of splitting out the chromosomes and running them all individually)?

Also, I have seen mention somewhere that it may be because the files have been zipped in an incompatible manner - however, the input gVCFs are exactly the same that were successfully passed through GenotypeGVCFs, so this seems unlikely. I saw on one old thread that CombineGVCFs was super-slow, but I don't know if this is still the situation? For reference, my command was as shown below:

java -Xmx100000m -Djava.io.tmpdir=$TMPDIR -jar /apps/GATK/3.3-0/GenomeAnalysisTK.jar \
        -T CombineGVCFs \
        -R hsapiens.hs37d5.fasta \
        -V /path/File000001.g.vcf.gz \
             ....
        -V /path/File000100.g.vcf.gz \
        -nt 14 \
         -o AllgVCFsCombined.nt14.g.vcf

Thanks, in advance, as always.

↧

Tumor-Normal Paired Exome-Sequencing pipeline

August 9, 2015, 4:38 am

≫ Next: BroadE Workshop 2015 March 19-20: Best Practices for Variant Calling with the GATK

≪ Previous: How to speed up CombineGVCFs - seems unfeasibly slow?

Hello,
We're currently running a research project involving exome sequencing of tumors and paired normal blood samples from brain cancer patients.
I'm relatively new to NGS processing but thanks to this forum (and the excellent documentation provided by the Broad) I was able to put together the following pipeline for this project:

I just wanted to check with the Cancer Team that everything looks OK, as well as get any suggestions, if possible.
Looking forward to your valuable response,
-E

↧

BroadE Workshop 2015 March 19-20: Best Practices for Variant Calling with the GATK

August 13, 2015, 6:50 pm

≫ Next: Away Workshop (UK) 2015 April 20-24: Best Practices for Variant Calling with GATK

≪ Previous: Tumor-Normal Paired Exome-Sequencing pipeline

The full GATK team presented this workshop at the Broad Institute with support form the BroadE education program.

This workshop covered the core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. The presentation materials describe why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

The workshop materials are available at this link if you're viewing this post in the forum, or below if you are viewing the presentation page already.

↧

Away Workshop (UK) 2015 April 20-24: Best Practices for Variant Calling with GATK

August 13, 2015, 6:57 pm

≫ Next: Away Workshop (Pretoria, SA) 2015 June 22-26: GATK Best Practices and Beyond

≪ Previous: BroadE Workshop 2015 March 19-20: Best Practices for Variant Calling with the GATK

Joel Thibault, Valentin Ruano-Rubio and Geraldine Van der Auwera presented this workshop in Edinburgh, Scotland, and Cambridge, England, upon invitation from the Universities of Edinburgh and Cambridge.

This workshop included two modules:

Best Practices for Variant Calling with the GATK

The core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. The presentation materials describe why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.
Beyond the Best Practices

Additional considerations such as calling variants in RNAseq data and calling cohorts efficiently, as well as dealing with non-human data, RNAseq data, whole-genome vs. exome, basic quality control, and performance.

This was complemented by a set of hands-on exercises aiming to teach basic GATK usage to new users.

The workshop materials are available at this link if you're viewing this post in the forum, or below if you are viewing the presentation page already.

↧

Away Workshop (Pretoria, SA) 2015 June 22-26: GATK Best Practices and Beyond

August 17, 2015, 4:35 pm

≫ Next: baserecalibrator and and ploidy

≪ Previous: Away Workshop (UK) 2015 April 20-24: Best Practices for Variant Calling with GATK

Laura Gauthier, Yossi Farjoun and Geraldine Van der Auwera presented this workshop in Pretoria, South Africa, upon invitation from the University of Pretoria.

This workshop covered the core steps involved in calling germline and somatic variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. The presentation materials describe why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

Additional considerations were covered, such as calling cohorts efficiently, as well as dealing with non-human data, RNAseq data, whole-genome vs. exome, basic quality control, and performance.

This was complemented by sets of hands-on tutorials aiming to teach basic GATK usage to new users, as well as introduce pipelining concepts using Queue.

The workshop was structured into five modules:

Introductory materials
Best Practices Phase 1: Pre-processing
Best Practices Phase 2A: Calling germline variants
Best Practices Phase 2B: Calling somatic variants
Best Practices Phase 3: Preliminary analyses

The workshop materials are available at this link if you're viewing this post in the forum, or below if you are viewing the presentation page already.

↧

baserecalibrator and and ploidy

August 31, 2015, 4:58 am

≫ Next: Mapping, processing and duplicate marking with Picard tools: ValidateSamFile errors

≪ Previous: Away Workshop (Pretoria, SA) 2015 June 22-26: GATK Best Practices and Beyond

Dear GATK Team,

baserecalibrator is in the newest Best Practices recommendations.
Does this tool support or would you recommend it's usage for samples with ploidy higher then 2 (e.g. pooled samples)?

Thanks!
Bernt

↧

Mapping, processing and duplicate marking with Picard tools: ValidateSamFile errors

June 1, 2015, 3:54 am

≫ Next: Problem with recalibrated bam files

≪ Previous: baserecalibrator and and ploidy

I am trying to follow the best practices for mapping my (Paired-end Illumina HiSeq) reads to the reference, by following this presentation:

From what I understand, I should use MergeBamAlignment to clean up the output from bwa, and then use this cleaned up output for the rest of the analysis. However, when I run ValidateSamFile after running MergeBamAlignment I get a lot of errors, and running CleanSam on the file does not resolve any of them. What am I doing wrong? I've tried searching the web for more details about MergeBamAlignment but I haven't been able to find much. Please let me know if you require any additional information.

How I ran MergeBamAlignment
picard-tools MergeBamAlignment \
UNMAPPED_BAM=unmapped_reads.sam \
ALIGNED_BAM=aligned_reads.sam \
OUTPUT=aligned_reads.merged.bam \
REFERENCE_SEQUENCE=/path/to/reference.fasta \
PAIRED_RUN=true # Why is this needed?

Error report from ValidateSamFile
## HISTOGRAM java.lang.String
Error Type Count
ERROR:INVALID_CIGAR 5261
ERROR:MATES_ARE_SAME_END 30
ERROR:MISMATCH_FLAG_MATE_NEG_STRAND 30

↧

Problem with recalibrated bam files

September 17, 2015, 2:56 pm

≫ Next: BroadE Workshop 2013 Oct 21-22: GATK Best Practices & Queue

≪ Previous: Mapping, processing and duplicate marking with Picard tools: ValidateSamFile errors

Hi,

I recently re-generated my realigned bam files (realigned.bam) from the GATK Best Practices, but I encountered an error when I validated the bam files using the validation tool (ValidateSamFile.jar) in Picard:

ERROR: Record 188823, Read name HWI-ST1329:286:C2DT8ACXX:2:1303:11866:8561, Mate alignment does not match alignment start of mate
ERROR: Record 188823, Read name HWI-ST1329:286:C2DT8ACXX:2:1303:11866:8561, Mate negative strand flag does not match read negative strand flag of mate

etc.

What exactly is going on? Where in the pipeline could this have occurred? I do not have the intermediary files so I cannot diagnose this accurately. Before I started the GATK Best Practices, my bam files had no problems; when I validated them I did not get any error messages. Does this mean that I have to regenerate the files again and start from the very beginning?

I tried to use the Picard tool FixMateInformation.jar, but then I got an error message regarding the index files. I tried deleting the bam index for one bam file and then recreating it to see if there was a problem with the indexes, but doing so did not resolve the issue. So it seems that something went wrong with the bam files at some earlier step.

Has this error been encountered before? Not sure how to proceed next, except restart everything.

Best,
Alva

↧

BroadE Workshop 2013 Oct 21-22: GATK Best Practices & Queue

October 25, 2013, 9:35 am

≫ Next: How to run Best Practices on parts of bam files

≪ Previous: Problem with recalibrated bam files

The full GATK team presented this workshop at the Broad Institute with support form the BroadE education program.

This workshop included two modules:

Best Practices for Variant Calling with the GATK

The core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. View the workshop materials to learn why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.
Building Analysis Pipelines with Queue

An introduction to the Queue pipelining system. View the workshop materials to learn about how to use Queue to create analysis pipelines, scatter-gather them and run them locally or in parallel on a computing farm.

↧

How to run Best Practices on parts of bam files

September 25, 2015, 4:52 pm

≫ Next: Calling variants in RNAseq

≪ Previous: BroadE Workshop 2013 Oct 21-22: GATK Best Practices & Queue

Hi,

So far, I have only tried running Best Practices on bam files where each bam file corresponds to a certain chromosome. How do I go about partitioning the bam files further and running Best Practices on parts of bam files? I want to make my pipeline even more efficient.

Thanks for your help,
Alva

↧

Calling variants in RNAseq

March 5, 2014, 11:15 pm

≫ Next: MPG Primer on sequencing and variant discovery pipelines

≪ Previous: How to run Best Practices on parts of bam files

Overview

This document describes the details of the GATK Best Practices workflow for SNP and indel calling on RNAseq data.

Please note that any command lines are only given as example of how the tools can be run. You should always make sure you understand what is being done at each step and whether the values are appropriate for your data. To that effect, you can find more guidance here.

In brief, the key modifications made to the DNAseq Best Practices focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller. Here is a detailed overview:

Caveats

Please keep in mind that our DNA-focused Best Practices were developed over several years of thorough experimentation, and are continuously updated as new observations come to light and the analysis methods improve. We have been working with RNAseq for a somewhat shorter time, so there are many aspects that we still need to examine in more detail before we can be fully confident that we are doing the best possible thing.

We know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline. A few examples of such errors are given in this article as well as our ideas for fixing them in the future.

We will be improving these recommendations progressively as we go, and we hope that the research community will help us by providing feedback of their experiences applying our recommendations to their data.

The workflow

1. Mapping to the reference

The first major difference relative to the DNAseq Best Practices is the mapping step. For DNA-seq, we recommend BWA. For RNA-seq, we evaluated all the major software packages that are specialized in RNAseq alignment, and we found that we were able to achieve the highest sensitivity to both SNPs and, importantly, indels, using STAR aligner. Specifically, we use the STAR 2-pass method which was described in a recent publication (see page 43 of the Supplemental text of the Pär G Engström et al. paper referenced below for full protocol details -- we used the suggested protocol with the default parameters). In brief, in the STAR 2-pass approach, splice junctions detected in a first alignment run are used to guide the final alignment.

Here is a walkthrough of the STAR 2-pass alignment steps:

1) STAR uses genome index files that must be saved in unique directories. The human genome index was built from the FASTA file hg19.fa as follows:

genomeDir=/path/to/hg19
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa\  --runThreadN <n>

2) Alignment jobs were executed as follows:

runDir=/path/to/1pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>

3) For the 2-pass STAR, a new index is then created using splice junction information contained in the file SJ.out.tab from the first pass:

genomeDir=/path/to/hg19_2pass
mkdir $genomeDir
STAR --runMode genomeGenerate --genomeDir $genomeDir --genomeFastaFiles hg19.fa \
    --sjdbFileChrStartEnd /path/to/1pass/SJ.out.tab --sjdbOverhang 75 --runThreadN <n>

4) The resulting index is then used to produce the final alignments as follows:

runDir=/path/to/2pass
mkdir $runDir
cd $runDir
STAR --genomeDir $genomeDir --readFilesIn mate1.fq mate2.fq --runThreadN <n>

2. Add read groups, sort, mark duplicates, and create index

The above step produces a SAM file, which we then put through the usual Picard processing steps: adding read group information, sorting, marking duplicates and indexing.

java -jar picard.jar AddOrReplaceReadGroups I=star_output.sam O=rg_added_sorted.bam SO=coordinate RGID=id RGLB=library RGPL=platform RGPU=machine RGSM=sample 

java -jar picard.jar MarkDuplicates I=rg_added_sorted.bam O=dedupped.bam  CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=output.metrics

3. Split'N'Trim and reassign mapping qualities

Next, we use a new GATK tool called SplitNCigarReads developed specially for RNAseq, which splits reads into exon segments (getting rid of Ns but maintaining grouping information) and hard-clip any sequences overhanging into the intronic regions.

In the future we plan to integrate this into the GATK engine so that it will be done automatically where appropriate, but for now it needs to be run as a separate step.

At this step we also add one important tweak: we need to reassign mapping qualities, because STAR assigns good alignments a MAPQ of 255 (which technically means “unknown” and is therefore meaningless to GATK). So we use the GATK’s ReassignOneMappingQuality read filter to reassign all good alignments to the default value of 60. This is not ideal, and we hope that in the future RNAseq mappers will emit meaningful quality scores, but in the meantime this is the best we can do. In practice we do this by adding the ReassignOneMappingQuality read filter to the splitter command.

Please note that we recently (6/11/14) edited this to fix a documentation error regarding the filter to use. See this announcement for details.

Finally, be sure to specify that reads with N cigars should be allowed. This is currently still classified as an "unsafe" option, but this classification will change to reflect the fact that this is now a supported option for RNAseq processing.

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

4. Indel Realignment (optional)

After the splitting step, we resume our regularly scheduled programming... to some extent. We have found that performing realignment around indels can help rescue a few indels that would otherwise be missed, but to be honest the effect is marginal. So while it can’t hurt to do it, we only recommend performing the realignment step if you have compute and time to spare (or if it’s important not to miss any potential indels).

5. Base Recalibration

We do recommend running base recalibration (BQSR). Even though the effect is also marginal when applied to good quality data, it can absolutely save your butt in cases where the qualities have systematic error modes.

Both steps 4 and 5 are run as described for DNAseq (with the same known sites resource files), without any special arguments. Finally, please note that you should NOT run ReduceReads on your RNAseq data. The ReduceReads tool will no longer be available in GATK 3.0.

6. Variant calling

Finally, we have arrived at the variant calling step! Here, we recommend using HaplotypeCaller because it is performing much better in our hands than UnifiedGenotyper (our tests show that UG was able to call less than 50% of the true positive indels that HC calls). We have added some functionality to the variant calling code which will intelligently take into account the information about intron-exon split regions that is embedded in the BAM file by SplitNCigarReads. In brief, the new code will perform “dangling head merging” operations and avoid using soft-clipped bases (this is a temporary solution) as necessary to minimize false positive and false negative calls. To invoke this new functionality, just add -dontUseSoftClippedBases to your regular HC command line. Note that the -recoverDanglingHeads argument which was previously required is no longer necessary as that behavior is now enabled by default in HaplotypeCaller. Also, we found that we get better results if we lower the minimum phred-scaled confidence threshold for calling variants on RNAseq data, so we use a default of 20 (instead of 30 in DNA-seq data).

java -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R ref.fasta -I input.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 -o output.vcf

7. Variant filtering

To filter the resulting callset, you will need to apply hard filters, as we do not yet have the RNAseq training/truth resources that would be needed to run variant recalibration (VQSR).

We recommend that you filter clusters of at least 3 SNPs that are within a window of 35 bases between them by adding -window 35 -cluster 3 to your command. This filter recommendation is specific for RNA-seq data.

As in DNA-seq, we recommend filtering based on Fisher Strand values (FS > 30.0) and Qual By Depth values (QD < 2.0).

java -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg_19.fasta -V input.vcf -window 35 -cluster 3 -filterName FS -filter "FS > 30.0" -filterName QD -filter "QD < 2.0" -o output.vcf

Please note that we selected these hard filtering values in attempting to optimize both high sensitivity and specificity together. By applying the hard filters, some real sites will get filtered. This is a tradeoff that each analyst should consider based on his/her own project. If you care more about sensitivity and are willing to tolerate more false positives calls, you can choose not to filter at all (or to use less restrictive thresholds).

An example of filtered (SNPs cluster filter) and unfiltered false variant calls:

An example of true variants that were filtered (false negatives). As explained in text, there is a tradeoff that comes with applying filters:

Known issues

There are a few known issues; one is that the allelic ratio is problematic. In many heterozygous sites, even if we can see in the RNAseq data both alleles that are present in the DNA, the ratio between the number of reads with the different alleles is far from 0.5, and thus the HaplotypeCaller (or any caller that expects a diploid genome) will miss that call. A DNA-aware mode of the caller might be able to fix such cases (which may be candidates also for downstream analysis of allele specific expression).

Although our new tool (splitNCigarReads) cleans many false positive calls that are caused by splicing inaccuracies by the aligners, we still call some false variants for that same reason, as can be seen in the example below. Some of those errors might be fixed in future versions of the pipeline with more sophisticated filters, with another realignment step in those regions, or by making the caller aware of splice positions.

As stated previously, we will continue to improve the tools and process over time. We have plans to improve the splitting/clipping functionalities, improve true positive and minimize false positive rates, as well as developing statistical filtering (i.e. variant recalibration) recommendations.

We also plan to add functionality to process DNAseq and RNAseq data from the same samples simultaneously, in order to facilitate analyses of post-transcriptional processes. Future extensions to the HaplotypeCaller will provide this functionality, which will require both DNAseq and RNAseq in order to produce the best results. Finally, we are also looking at solutions for measuring differential expression of alleles.

[1] Pär G Engström et al. “Systematic evaluation of spliced alignment programs for RNA-seq data”. Nature Methods, 2013

NOTE: Questions about this document that were posted before June 2014 have been moved to this archival thread: http://gatkforums.broadinstitute.org/discussion/4709/questions-about-the-rnaseq-variant-discovery-workflow

↧

MPG Primer on sequencing and variant discovery pipelines

October 15, 2015, 7:36 am

≫ Next: Best Practices for Variant Discovery in RNAseq

≪ Previous: Calling variants in RNAseq

Geraldine Van der Auwera presented this talk as part of the Broad Institute's Medical and Population Genetics (MPG) Primers series.

This talk provides a high-level overview of the workflow for performing variant discovery on high-throughput sequencing data, as described in the GATK Best Practices and implemented in the Broad's production pipelines.

The following points emphasized in this presentation are:

Informational content of data file formats and flow of information throughout the pipeline
Concepts involved in the data transformations (processing steps and analysis methods)
Motivation and key mechanics of the GVCF workflow for scalable joint variant discovery
Relation of the GATK Best Practices to the Broad's production pipeline implementation

The presentation slide deck is available at this link.

↧

Best Practices for Variant Discovery in RNAseq

April 16, 2014, 3:09 pm

≫ Next: Best recommendation for base recalibration on non-human data

≪ Previous: MPG Primer on sequencing and variant discovery pipelines

This article is part of the Best Practices documentation. See http://www.broadinstitute.org/gatk/guide/best-practices for the full documentation set.

This is our recommended workflow for calling variants in RNAseq data from single samples, in which all steps are performed per-sample. In future we will provide cohort analysis recommendations, but these are not yet available.

The workflow is divided in three main sections that are meant to be performed sequentially:

Pre-processing: from raw RNAseq sequence reads (FASTQ files) to analysis-ready reads (BAM files)
Variant discovery: from reads (BAM files) to variants (VCF files)
Refinement and evaluation: genotype refinement, functional annotation and callset QC

Compared to the DNAseq Best Practices, the key adaptations for calling variants in RNAseq focus on handling splice junctions correctly, which involves specific mapping and pre-processing procedures, as well as some new functionality in the HaplotypeCaller, which are highlighted in the figure below.

Pre-Processing

The data generated by the sequencers are put through some pre-processing steps to make it suitable for variant calling analysis. The steps involved are: Mapping and Marking Duplicates; Split'N'Trim; Local Realignment Around Indels (optional); and Base Quality Score Recalibration (BQSR); performed in that order.

Mapping and Marking Duplicates

The sequence reads are first mapped to the reference using STAR aligner (2-pass protocol) to produce a file in SAM/BAM format sorted by coordinate. The next step is to mark duplicates. The rationale here is that during the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process identifies these reads as such so that the GATK tools know they should ignore them.

Split'N'Trim

Then, an RNAseq-specific step is applied: reads with N operators in the CIGAR strings (which denote the presence of a splice junction) are split into component reads and trimmed to remove any overhangs into splice junctions, which reduces the occurrence of artifacts. At this step, we also reassign mapping qualities from 255 (assigned by STAR) to 60 which is more meaningful for GATK tools.

Realignment Around Indels

Next, local realignment is performed around indels, because the algorithms that are used in the initial mapping step tend to produce various types of artifacts. For example, reads that align on the edges of indels often get mapped with mismatching bases that might look like evidence for SNPs, but are actually mapping artifacts. The realignment process identifies the most consistent placement of the reads relative to the indel in order to clean up these artifacts. It occurs in two steps: first the program identifies intervals that need to be realigned, then in the second step it determines the optimal consensus sequence and performs the actual realignment of reads. This step is considered optional for RNAseq.

Base Quality Score Recalibration

Finally, base quality scores are recalibrated, because the variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read. These scores are per-base estimates of error emitted by the sequencing machines. Unfortunately the scores produced by the machines are subject to various sources of systematic error, leading to over- or under-estimated base quality scores in the data. Base quality score recalibration is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. This yields more accurate base qualities, which in turn improves the accuracy of the variant calls. The base recalibration process involves two key steps: first the program builds a model of covariation based on the data and a set of known variants, then it adjusts the base quality scores in the data based on the model.

Variant Discovery

Once the data has been pre-processed as described above, it is put through the variant discovery process, i.e. the identification of sites where the data displays variation relative to the reference genome, and calculation of genotypes for each sample at that site. Because some of the variation observed is caused by mapping and sequencing artifacts, the greatest challenge here is to balance the need for sensitivity (to minimize false negatives, i.e. failing to identify real variants) vs. specificity (to minimize false positives, i.e. failing to reject artifacts). It is very difficult to reconcile these objectives in a single step, so instead the variant discovery process is decomposed into separate steps: variant calling (performed per-sample) and variant filtering (also performed per-sample). The first step is designed to maximize sensitivity, while the filtering step aims to deliver a level of specificity that can be customized for each project.

Our current recommendation for RNAseq is to run all these steps per-sample. At the moment, we do not recommend applying the GVCF-based workflow to RNAseq data because although there is no obvious obstacle to doing so, we have not validated that configuration. Therefore, we cannot guarantee the quality of results that this would produce.

Per-Sample Variant Calling

We perform variant calling by running the HaplotypeCaller on each sample BAM file (if a sample's data is spread over more than one BAM, then pass them all in together) to create single-sample VCFs containing raw SNP and indel calls.

Per-Sample Variant Filtering

For RNAseq, it is not appropriate to apply variant recalibration in its present form. Instead, we provide hard-filtering recommendations to filter variants based on specific annotation value thresholds. This produces a VCF of calls annotated with fiiltering information that can then be used in downstream analyses.

Refinement and evaluation

In this last section, we perform some refinement steps on the genotype calls (GQ estimation and transmission phasing), add functional annotations if desired, and do some quality evaluation by comparing the callset to known resources. None of these steps are absolutely required, and the workflow may need to be adapted quite a bit to each project's requirements.

Important note on GATK versions

The Best Practices have been updated for GATK version 3. If you are running an older version, you should seriously consider upgrading. For more details about what has changed in each version, please see the Version History section. If you cannot upgrade your version of GATK for any reason, please look up the corresponding version of the GuideBook PDF (also in the Version History section) to ensure that you are using the appropriate recommendations for your version.

↧

Best recommendation for base recalibration on non-human data

October 22, 2012, 11:21 pm

≫ Next: Is the SOR the natural logarithm of odds ratio?

≪ Previous: Best Practices for Variant Discovery in RNAseq

Hi all!

I'm currently working on high-coverage non-human data (mammals).

After mapping via BWA, soritng and merging, my pipeline goes like this:

Remove duplicates via MarkDuplicates.
RealignerTargetCreator
IndelRealigner using the --consensusDeterminationModel USE_SW flag
Fix Mates

I currently want to begin the recalibration step before doing the actual variant calls via UnifiedGenotyper.

Since I'm working on non-human data, there is no thorough database I can trust as an input vcf file for the recalibration step.

What is your recommendation for this for non-human data?

Thank you very much!

Sagi

↧

Is the SOR the natural logarithm of odds ratio?

November 19, 2015, 8:02 am

≫ Next: Calling variants on cohorts of samples using the HaplotypeCaller in GVCF mode

≪ Previous: Best recommendation for base recalibration on non-human data

Hi,

I'm trying to figure out a good threshold for the SOR for hard-filtering a SNP data set. The best practices documentation suggests using a cutoff threshold of 4, but in reading a bit more about odds-ratios, it seems that taking the natural logarithm of the odds-ratio may be more ideal. Is the SOR reported in the vcf file actually the natural logarithm of the odds ratio? This information is not provided in the page on statistical tests or the documentation for the SOR.

Thanks,
Jen

↧

Calling variants on cohorts of samples using the HaplotypeCaller in GVCF mode

March 6, 2014, 12:12 am

≫ Next: Documentation error in RNAseq workflow

≪ Previous: Is the SOR the natural logarithm of odds ratio?

This document describes the new approach to joint variant discovery that is available in GATK versions 3.0 and above. For a more detailed discussion of why it's better to perform joint discovery, see this FAQ article. For more details on how this fits into the overall reads-to-variants analysis workflow, see the Best Practices workflows documentation.

Overview

This is the workflow recommended in our Best Practices for performing variant discovery analysis on cohorts of samples.

In a nutshell, we now call variants individually on each sample using the HaplotypeCaller in -ERC GVCF mode, leveraging the previously introduced reference model to produce a comprehensive record of genotype likelihoods and annotations for each site in the genome (or exome), in the form of a gVCF file (genomic VCF).

In a second step, we then perform a joint genotyping analysis of the gVCFs produced for all samples in a cohort.
This allows us to achieve the same results as joint calling in terms of accurate genotyping results, without the computational nightmare of exponential runtimes, and with the added flexibility of being able to re-run the population-level genotyping analysis at any time as the available cohort grows.

This is meant to replace the joint discovery workflow that we previously recommended, which involved calling variants jointly on multiple samples, with a much smarter approach that reduces computational burden and solves the "N+1 problem".

Workflow details

This is a quick overview of how to apply the workflow in practice. For more details, see the Best Practices workflows documentation.

1. Variant calling

Run the HaplotypeCaller on each sample's BAM file(s) (if a sample's data is spread over more than one BAM, then pass them all in together) to create single-sample gVCFs, with the option -emitRefConfidence GVCF, and using the .g.vcf extension for the output file.

Note that versions older than 3.4 require passing the options --variant_index_type LINEAR --variant_index_parameter 128000 to set the correct index strategy for the output gVCF.

2. Optional data aggregation step

If you have more than a few hundred samples, run CombineGVCFs on batches of ~200 gVCFs to hierarchically merge them into a single gVCF. This will make the next step more tractable.

3. Joint genotyping

Take the outputs from step 2 (or step 1 if dealing with fewer samples) and run GenotypeGVCFs on all of them together to create the raw SNP and indel VCFs that are usually emitted by the callers.

4. Variant recalibration

Finally, resume the classic GATK Best Practices workflow by running VQSR on these "regular" VCFs according to our usual recommendations.

That's it! Fairly simple in practice, but we predict this is going to have a huge impact in how people perform variant discovery in large cohorts. We certainly hope it helps people deal with the challenges posed by ever-growing datasets.

As always, we look forward to comments and observations from the research community!

↧

Documentation error in RNAseq workflow

June 11, 2014, 2:20 pm

≫ Next: `MalformedReadFilter` seemingly overly aggressive?

≪ Previous: Calling variants on cohorts of samples using the HaplotypeCaller in GVCF mode

We discovered today that we made an error in the documentation article that describes the RNAseq Best Practices workflow. The error is not critical but is likely to cause an increased rate of False Positive calls in your dataset.

The error was made in the description of the "Split & Trim" pre-processing step. We originally wrote that you need to reassign mapping qualities to 60 using the ReassignMappingQuality read filter. However, this causes all MAPQs in the file to be reassigned to 60, whereas what you want to do is reassign MAPQs only for good alignments which STAR identifies with MAPQ 255. This is done with a different read filter, called ReassignOneMappingQuality. The correct command is therefore:

java -jar GenomeAnalysisTK.jar -T SplitNCigarReads -R ref.fasta -I dedupped.bam -o split.bam -rf ReassignOneMappingQuality -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

In our hands we see a bump in the rate of FP calls from 4% to 8% when the wrong filter is used. We don't see any significant amount of false negatives (lost true positives) with the bad command, although we do see a few more true positives show up in the results of the bad command. So basically the effect is to excessively increase sensitivity, at the expense of specificity, because poorly mapped reads are taken into account with a "good" mapping quality, where they would normally be discarded.

This effect will be stronger in datasets with lower overall quality, so your results may vary. Let us know if you observe any really dramatic effects, but we don't expect that to happen.

To be clear, we do recommend re-processing your data if you can, but if that is not an option, keep in mind how this affects the rate of false positive discovery in your data.

We apologize for this error (which has now been corrected in the documentation) and for the inconvenience it may cause you.

↧

`MalformedReadFilter` seemingly overly aggressive?

September 3, 2015, 1:51 am

≫ Next: Unparsable vcf record with allele *

≪ Previous: Documentation error in RNAseq workflow

I recently came across(blog post) a scenario in which a subset of my reads had triggered the MalformedReadFilter during indel realignment. I'm curious to know about the definition of "malformed", as whilst invalid, I wouldn't have described incorrect mate names and alignment positions for unmapped reads as "malformed" and capable of causing job crashes...

Not a bug - I'm just curious to know of any implementation details or reasoning about the MalformedReadFilter.

↧

Unparsable vcf record with allele *

January 6, 2016, 5:01 am

≫ Next: (How to) Mark duplicates with MarkDuplicates or MarkDuplicatesWithMateCigar

≪ Previous: `MalformedReadFilter` seemingly overly aggressive?

Hi guys,
I have recently jointly called 27 full genome data using GenotypeGVCFs approach. While i was trying to extract some chromosomes from the final file, i got this error
The provided VCF file is malformed at approximately line number 16076: Unparsable vcf record with allele *.

I look into the file and I found some of the multi-allellic sites having * as seen in the attached picture.

I feel the problem could be that the program realised that more than one allele is present at that position but could not ascertain which allele. I may be wrong but please what do you think I can do to solve this problem.
LAWAL

↧