Missing reference allele in GVCF file after running HaplotypeCaller

December 5, 2014, 7:27 am

≫ Next: Differences between BQSR pipeline in two different GATK documents

Hi,

I used HaplotypeCaller in GVCF mode to generate a single sample GVCF, but when I checked my vcf file I see that the reference allele is not showing up:

22  1   .   N   <NON_REF>   .   .   END=16050022    GT:DP:GQ:MIN_DP:PL  0/0:0:0:0:0,0,0
 22 16050023    .   C   <NON_REF>   .   .   END=16050023    GT:DP:GQ:MIN_DP:PL  0/0:1:3:1:0,3,37
22  16050024    .   A   <NON_REF>   .   .   END=16050026    GT:DP:GQ:MIN_DP:PL  0/0:2:6:2:0,6,73
22  16050027    .   A   <NON_REF>   .   .   END=16050035    GT:DP:GQ:MIN_DP:PL  0/0:3:9:3:0,9,110
22  16050036    .   A   C,<NON_REF> 26.80   .   BaseQRankSum=-0.736;ClippingRankSum=-0.736;DP=3;MLEAC=1,0;MLEAF=0.500,0.00;MQ=27.00;MQ0=0;MQRankSum=-0.736;ReadPosRankSum=0.736 GT:AD:DP:GQ:PL:SB   0/1:1,2,0:3:23:55,0,23,58,29,86:1,0,2,0
22  16050037    .   G   <NON_REF>   .   .   END=16050037    GT:DP:GQ:MIN_DP:PL  0/0:3:9:3:0,9,109
22  16050038    .   A   <NON_REF>   .   .   END=16050039    GT:DP:GQ:MIN_DP:PL  0/0:4:12:4:0,12,153

I am not sure where to start troubleshooting for this, since all the steps prior to using HaplotypeCaller did not generate any obvious errors.

The basic command that I used was:
java -Xmx4g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hs37d5.fa -I recal_1.bam -o raw_1.vcf -L 22 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000

Have you encountered this problem before? Where should I start troubleshooting?

Thanks very much in advance,
Alva

↧

Differences between BQSR pipeline in two different GATK documents

December 5, 2014, 1:10 pm

≫ Next: Best Practices VQSR parameters updated

≪ Previous: Missing reference allele in GVCF file after running HaplotypeCaller

Hi,

I noticed that the pipeline for BQSR is slightly different between the "Workshop walkthrough (Brussels 2014)" document and the "(howto) Recalibrate base quality scores = run BQSR" document. I used the former document as my guide since it is more recent, but I am curious as to why these two are different. The first step is the same for the 2 docs, but then for the second step, the output is a recal.bam file in the brussels document instead of a post_recal.table file in the (howto) document. Then, I am confused about the (howto) document because it seems that the 4th step is exactly the same as the second step, except we generate a recal.bam file. Is the brussels document presenting steps that are more efficient? (since there is only 1 use of -BQSR to generate the recalibrated bam file we need, as opposed to 2 uses of the option)

Thanks,
Alva

↧

Best Practices VQSR parameters updated

December 16, 2014, 1:48 pm

≫ Next: genotype calling in gatk

≪ Previous: Differences between BQSR pipeline in two different GATK documents

The Best Practices recommendations for Variant Quality Score Recalibration have been slightly updated to use the new(ish) StrandOddsRatio (SOR) annotation, which complements FisherStrand (FS) as indicator of strand bias (only available in GATK version 3.3-0 and above).

While we were at it we also reconciled some inconsistencies between the tutorial and the FAQ document. As a reminder, if you ever find differences between parameters given in the VQSR docs, let us know, but FYI that the FAQ is the ultimate source of truth=true. Note also that the command line example given in VariantRecalibrator tool doc tends to be out of date because it can only be updated with the next release (due to a limitation of the tool doc generation system) and, well, we often forget to do it in time -- so it should never be used as a reference for Best Practice parameter values, as indicated in the caveat right underneath it which no one ever reads.

Speaking of caveats, there's no such thing as too much repetition of the fact that whole genomes and exomes have subtle differences that require some tweaks to your command lines. In the case of VQSR, that means dropping Coverage (DP) from your VQSR command lines if you're working with exomes.

Finally, keep in mind that the values we recommend for tranches are really just examples; if there's one setting you should freely experiment with, that's the one. You can specify as many tranche cuts as you want to get really fine resolution.

↧

genotype calling in gatk

January 15, 2015, 10:37 pm

≫ Next: CombineGVCFs - All alleles are missing

≪ Previous: Best Practices VQSR parameters updated

Excuse me:

After calling the genotype using Haplotyper Caller in gatk, i manually check the reads covering the variant sites. But i found one exception: The genotype of one sample was 0/0，wild type,but the reads for this site were .,,,..,,c.,c.c,.cccc. I think almost all the reads should be . or ,. Is this case normal?

Many thanks in advance!

↧

CombineGVCFs - All alleles are missing

January 27, 2015, 1:59 pm

≫ Next: Registration open for GATK workshop (March 19-20)

≪ Previous: genotype calling in gatk

Hi,

I am combining gcvf files into single gvcf files by chromosome, using CombineGVCFs, in order to run GenotypeGVCFs. When I checked the first gvcf file generated by CombineGVCFs, I noticed that at each position, all the alleles were missing.

For example, at position 16050036, this is what comes up in the final gvcf file:
22 16050036 . A C,<NON_REF> . . BaseQRankSum=-7.360e-01;ClippingRankSum=-7.360e-01;DP=4;MQ=27.00;MQ0=0;MQRankSum=-7.360e-01;ReadPosRankSum=0.736 GT:AD:DP:MIN_DP:PL:SB ./.:1,2,0:3:.:55,0,23,58,29,86:1,0,2,0 ./.:.:1:1:0,0,0,0,0,0 ./.:.:0:0:0,0,0,0,0,0

But if we just take one of the precursor gvcf files (one individual), we clearly see the genotype at that site:
22 16050036 . A C,<NON_REF> 26.80 . BaseQRankSum=-0.736;ClippingRankSum=-0.736;DP=3;MLEAC=1,0;MLEAF=0.500,0.00;MQ=27.00;MQ0=0;MQRankSum=-0.736;ReadPosRankSum=0.736 GT:AD:DP:GQ:PL:SB 0/1:1,2,0:3:23:55,0,23,58,29,86:1,0,2,0

The command I'm using to generate these files is:

java -Xmx1g -jar GenomeAnalysisTK.jar -T CombineGVCFs -R hs37d5.fa -V vcfs.chr${numchr}.new.list -o mergeGvcf_${numchr}.vcf -L ${numchr}
where numchr is a variable previously defined (indicating the chromosome number).

It seems that all the information is being taken into account except the actual genotypes. How do I solve this problem?

Thanks,
Alva

↧

Registration open for GATK workshop (March 19-20)

February 23, 2015, 1:13 pm

≫ Next: Mini Workshop (ASHG) 2014 Oct 21: Best Practices for Variant Discovery with GATK

≪ Previous: CombineGVCFs - All alleles are missing

Registration is now LIVE for our upcoming BroadE Workshop: Best Practices for Variant Calling with the GATK.

WHEN: Thursday, March 19 & Friday, March 20, 2015

10:00 AM - 5:00 PM (Lecture, March 19)
2:00PM - 5:00 PM (Optional Tutorial, March 20)

WHERE: Broad Institute

Auditorium (lecture)/Yellowstone (Tutorial)
415 Main Street
Cambridge, Massachusetts 02142

Registration Schedule:

Registration closes February 27 at 5:00 PM.
Notification of acceptance or wait list status sent by March 4.

REGISTER HERE

This workshop will focus on the core steps involved in calling variants with the Broad’s Genome Analysis Toolkit, using the “Best Practices” developed by the GATK team. The GATK development team and invited guests will give talks explaining the rationale, theory and real-life applications of the Best Practices. You will learn why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset.

An optional hands-on session will be available to select participants. In this session, the GATK team will help beginners work through interactive exercises and tutorials to learn how to use GATK and apply the Best Practices to real data.

Workshop attendees will gain broad insight into the rationale of the GATK Best Practices for variant discovery, as well as a solid understanding of how individual GATK tools work and how to apply them in practice. Novices to the GATK will come out of the workshop knowing enough to identify which questions they can use address using GATK tools, how to get started on designing their experiment and analytical workflow, and how to run the tools on their own computer. Existing GATK users will come out with a deeper understanding of how the GATK works under the hood and how to improve their results further, especially as regards the latest innovations.

↧

Mini Workshop (ASHG) 2014 Oct 21: Best Practices for Variant Discovery with GATK

October 31, 2014, 4:11 pm

≫ Next: GATK VQSR tranches

≪ Previous: Registration open for GATK workshop (March 19-20)

Ami Levy-Moonshine presented this condensed 90-minute workshop given at ASHG 2014 in San Diego, CA on October 21.

This workshop covered all the core steps involved in calling variants with the GATK, using the “Best Practices” developed by the GATK team. The presentation materials outline why each step is essential to the calling process and what are the key operations performed on the data at each step. This includes specific information about variant calling in RNAseq data and efficient analysis of cohorts.

His slide deck is available at this link if you're viewing this post in the forum, or below if you are viewing the presentation page already.

GATK was also featured in another mini-workshop at ASHG which covered the iSeqTools network, focused on cloud-based analysis. The presentation slides will be posted to the iSeqTools website in the near future.

Best Practices for variant discovery in DNA:

Best Practices for variant discovery in RNAseq:

Excerpt from Ami's ASHG poster:

↧

GATK VQSR tranches

March 11, 2015, 1:38 pm

≫ Next: Best Practice pipeline for WES, including VQSR - what is best?

≪ Previous: Mini Workshop (ASHG) 2014 Oct 21: Best Practices for Variant Discovery with GATK

Hi all - I'm stumped and need your help. I'm following the GATK best practices for calling variants with HaplotypeCaller in GVCF mode. One of my samples is NA12878, among 119 others samples in my cohort. For some reason GATK is missing a bunch of variants in this sample that I can clearly see in IGV but are not listed in the VCF. I discovered that the variant is being filtered out..reason being VQSRTranchesSNP99.00to99.90. The genotype is homozygous variant, DP is 243, Qual is 524742.54 and its known in dbSNP. I suspect this is happening to other variants.

How do I adjust VQSR or how tranches are used and variants get placed in? I supposed I need to fine tune my parameters...but I would think something as obvious as this variant would pass Filtering.

↧

Best Practice pipeline for WES, including VQSR - what is best?

April 1, 2015, 7:17 am

≫ Next: Workshop presentations -- 2015 UK 4/20-24

≪ Previous: GATK VQSR tranches

I am currently processing ~100 exomes and following the Best Practice recommendations for Pre-processing and Variant Discovery. However, there are a couple of gaps in the documentation, as far as I can tell, regarding exactly how to proceed with VQSR with exome data. I would be grateful for some feedback, particularly regarding VQSR. The issues are similar to those discussed on this thread: http://gatkforums.broadinstitute.org/discussion/4798/vqsr-using-capture-and-padding but my questions aren't fully-addressed there (or elsewhere on the Forum as far as I can see).

Prior Steps:
1) All samples processed with same protocol (~60Mb capture kit) - coverage ~50X-100X
2) Alignment with BWA-MEM (to whole genome)
3) Remove duplicates, indel-realignment, bqsr
4) HC to produce gVCFs (-ERC)
5) Genotype gVCFs

This week I have been investigating VQSR, which has generated some questions.

Q1) Which regions should I use from my data for building the VQSR model?

Here I have tried 3 different input datasets:

a) All my variant positions (11Million positions)
b) Variant positions that are in the capture kit (~326k positions) - i.e. used bedtools intersect to only extract variants from (1)
c) Variant positions that are in the capture kit with padding of 100nt either side (~568k positions) - as above but bed has +/-100 on regions + uniq to remove duplicate variants that are now in more than one bed region

For each of the above, I have produced "sensitive" and "specific" datasets:
"Specific" --ts_filter_level 90.0 \ for both SNPs and INDELs
"Sensitive" --ts_filter_level 99.5 \ for SNPs, and --ts_filter_level 99.0 \ for INDELs (as suggested in the definitive FAQ https://www.broadinstitute.org/gatk/guide/article?id=1259 )

I also wanted to see what effect, if any, the "-tranche" argument has - i.e. does it just allow for ease of filtering, or does it affect the mother generated, since it was not clear to me. I applied either 5 tranches or 6:

5-tranche: -tranche 100.0 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 90.0 \ for both SNPs and INDELs
6-tranche: -tranche 100.0 -tranche 99.9 -tranche 99.5 -tranche 99.0 -tranche 95.0 -tranche 90.0 \ for both SNPs and INDELs

To compare the results I then used bed intersect to get back to the variants that are within the capture kit (~326k, as before). The output is shown in the spreadsheet image below.

http://i58.tinypic.com/25gc4k7.png

What the table appears to show me, is that at the "sensitive" settings (orange background), the results are largely the same - the difference between "PASS" in the set at the bottom where all variants were being used, and the others is mostly accounted for by variants being pushed into the 99.9-100 tranche.

However, when trying to be specific (blue background), the difference between using all variants, or just the capture region/capture+100 is marked. Also surprising (at least for me) is the huge difference in "PASS" in cells E15 and E16, where the only difference was the number of tranches given to the model (note that there is very little difference in the analogous cells in Rows 5/6 andRows 10/11.

Q2) Can somebody explain why there is such a difference in "PASS" rows between All-SPEC and the Capture(s)-Spec
Q3) Can somebody explain why 6 tranches resulted in ~23k more PASSes than 5 tranches for the All-SPEC
Q4) What does "PASS" mean in this context - a score =100? Is it an observation of a variant position in my data that has been observed in the "truth" set? It isn't actually described in the header of the VCF, though presumably the following corresponds:
FILTER=<ID=VQSRTrancheSNP99.90to100.00+,Description="Truth sensitivity tranche level for SNP model at VQS Lod < -38682.8777">
Q5) Similarly, why do no variants fall below my lower tranche threshold of 90? Is it because they are all reliable at least to this level?

Q6) Am I just really confused? :-(

Thanks in advance for your help! :-)

↧

Workshop presentations -- 2015 UK 4/20-24

April 19, 2015, 8:47 am

≫ Next: Unusual calls after using HaplotypeCaller - filtered with VQSR and refinement workflow

≪ Previous: Best Practice pipeline for WES, including VQSR - what is best?

The presentation slides from the 2015 "GATK in the UK" workshop (April 20-24) are available on DropBox here.

All slides will ultimately be posted in the Presentations section of the Guide.

↧

Unusual calls after using HaplotypeCaller - filtered with VQSR and refinement workflow

March 29, 2015, 11:00 am

≫ Next: Best practice for variant calling on Ion Torrent data with GATK

≪ Previous: Workshop presentations -- 2015 UK 4/20-24

Hi,

I have discovered some unusual calls in my VCF file after using HaplotypeCaller. I am using version 3.3 of GATK. I applied VQSR as well as the genotype refinement workflow (CalculateGenotypePosteriors, etc.) to refine the calls, but the unusual calls did not get removed. I also calculated the number of Mendelian error just in the biallelic SNPs in my final VCF file (using PLINK) and found unusually high percentage for each of the 3 families I am studying: 0.153%, 0.167%, and 0.25%. The percentage of triallelic SNPs is also very high: 0.111%. Why are the error rates so high?

I used the following commands to call the variants and generate the initial VCF file:

HaplotypeCaller (generate gvcf files for each individual for each chromosome

java -Xmx1g -jar GenomeAnalysisTK.jar -T HaplotypeCaller -R hs37d5.fa -I recal_${ROOT}.bam -o ${outpath}raw_${ROOT}.vcf --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000

GenotypeGVCFs (generate vcf files for each chr)

java -Xmx1g -jar GenomeAnalysisTK.jar -R hs37d5.fa -T GenotypeGVCFs -V vcfs.chr${numchr}.new.list -o final_chr${numchr}.vcf -L ${numchr}

CatVariants (generate 1 vcf file with all inds and all chrs)

java -Xmx1g -cp GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants -R hs37d5.fa -V final.new.list -out final_allHutt.vcf -assumeSorted

VQSR

java -Xmx4g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hs37d5.fa -input final_allHutt.vcf -resource:hapmap,known=false,training=true,truth=true,prior=15.0 hapmap_3.3.b37.vcf -resource:omni,known=false,training=true,truth=true,prior=12.0 1000G_omni2.5.b37.vcf -resource:1000G,known=false,training=true,truth=false,prior=10.0 1000G_phase1.snps.high_confidence.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf -an QD -an MQ -an MQRankSum -an ReadPosRankSum -an FS -an SOR -an DP -mode SNP -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_SNP_allHutt_2.recal -tranchesFile recalibrate_SNP_allHutt_2.tranches

Used excludeFiltered here

java -Xmx3g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R hs37d5.fa -input final_allHutt.vcf -mode SNP --ts_filter_level 99.9 --excludeFiltered --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_SNP_allHutt_2.recal -tranchesFile recalibrate_SNP_allHutt_2.tranches -o recalibrated_snps_raw_indels_allHutt_filteredout.vcf

java -Xmx3g -jar GenomeAnalysisTK.jar -T VariantRecalibrator -R hs37d5.fa -input recalibrated_snps_raw_indels_allHutt_filteredout.vcf -resource:mills,known=false,training=true,truth=true,prior=12.0 Mills_and_1000G_gold_standard.indels.b37.vcf -resource:dbsnp,known=true,training=false,truth=false,prior=2.0 dbsnp_138.b37.vcf -an QD -an DP -an FS -an SOR -an ReadPosRankSum -an MQRankSum -mode INDEL -tranche 100.0 -tranche 99.9 -tranche 99.0 -tranche 90.0 --maxGaussians 4 --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_INDEL_allHutt_filteredout.recal -tranchesFile recalibrate_INDEL_allHutt_filteredout.tranches

Used excludeFiltered here

java -Xmx3g -jar GenomeAnalysisTK.jar -T ApplyRecalibration -R hs37d5.fa -input recalibrated_snps_raw_indels_allHutt_filteredout.vcf -mode INDEL --ts_filter_level 99.0 --excludeFiltered --disable_auto_index_creation_and_locking_when_reading_rods -recalFile recalibrate_INDEL_allHutt_filteredout.recal -tranchesFile recalibrate_INDEL_allHutt_filteredout.tranches -o recalibrated_variants_allHutt_filteredout.vcf

Genotype Refinement Workflow

java -Xmx3g -jar GenomeAnalysisTK.jar -T CalculateGenotypePosteriors -R hs37d5.fa --supporting ALL.wgs.phase3_shapeit2_mvncall_integrated_v5.20130502.sites.vcf -ped Hutt.ped -V recalibrated_variants_allHutt_filteredout.vcf -o recalibrated_variants_allHutt.postCGP.f.vcf

java -Xmx3g -jar GenomeAnalysisTK.jar -T VariantFiltration -R hs37d5.fa -V recalibrated_variants_allHutt.postCGP.f.vcf -G_filter "GQ < 20.0" -G_filterName lowGQ -o recalibrated_variants_allHutt.postCGP.Gfiltered.f.vcf

Again, the first genotype in this example (indel) passed VariantFiltration even though its coverage was zero (2/2:0,0,0:0:PASS)

The entire example is below:

1 20199272 . T TCTTC,C 3520.08 PASS AC=8,22;AF=0.160,0.440;AN=50;BaseQRankSum=-1.356e+00;ClippingRankSum=-1.267e+00;DP=487;FS=4.843;GQ_MEAN=27.84;GQ_STDDEV=40.31;InbreedingCoeff=0.1002;MLEAC=1,12;MLEAF=0.020,0.240;MQ=51.74;MQ0=0;MQRankSum=0.421;NCC=2;PG=0,0,0,0,0,0;QD=32.53;ReadPosRankSum=1.27;SOR=0.699;VQSLOD=0.687;culprit=FS GT:AD:DP:FT:GQ:PGT:PID:PL:PP 2/2:0,0,0:0:PASS:22:.:.:410,207,355,32,22,0:410,207,355,32,22,0 2/2:0,0,1:1:lowGQ:5:.:.:240,51,36,18,5,0:240,51,36,18,5,0 0/2:4,0,4:8:PASS:90:.:.:140,153,256,0,103,90:140,153,256,0,103,90 0/0:22,0,0:22:lowGQ:0:.:.:0,0,390,0,390,390:0,0,390,0,390,390 0/0:2,0,0:2:lowGQ:3:.:.:0,3,45,3,45,45:0,3,45,3,45,45 2/2:0,0,3:3:lowGQ:11:.:.:287,135,124,21,11,0:287,135,124,21,11,0 ./.:7,0,0:7:PASS 2/2:0,0,3:4:lowGQ:11:.:.:282,126,115,22,11,0:282,126,115,22,11,0 0/2:10,0,0:10:lowGQ:5:.:.:27,5,494,0,411,405:27,5,494,0,411,405 0/2:7,0,0:7:lowGQ:13:.:.:13,15,502,0,303,288:13,15,502,0,303,288 0/1:8,6,0:14:PASS:99:.:.:194,0,255,218,273,491:194,0,255,218,273,491 0/0:18,0,0:18:PASS:52:.:.:0,52,755,52,755,755:0,52,755,52,755,755 2/2:0,0,0:0:PASS:23:.:.:305,168,416,23,30,0:305,168,416,23,30,0 0/2:0,0,4:4:lowGQ:14:.:.:40,14,634,0,185,699:40,14,634,0,185,699 0/0:19,0,0:19:PASS:58:.:.:0,58,824,58,824,824:0,58,824,58,824,824 0/0:1,0,0:1:lowGQ:6:0|1:20199257_CT_C:0,6,91,6,91,91:0,6,91,6,91,91 1/1:0,0,0:0:lowGQ:2:.:.:177,11,0,12,2,44:177,11,0,12,2,44 0/1:0,0,3:3:PASS:34:.:.:94,0,388,34,38,304:94,0,388,34,38,304 0/2:15,0,2:17:lowGQ:18:0|1:20199249_CT_C:18,64,695,0,632,624:18,64,695,0,632,624 1/1:0,0,0:0:lowGQ:8:.:.:133,8,0,101,17,265:133,8,0,101,17,265 0/2:3,0,0:3:PASS:25:.:.:129,25,484,0,121,94:129,25,484,0,121,94 0/2:2,0,0:2:PASS:38:.:.:185,38,644,0,88,42:185,38,644,0,88,42 0/2:2,0,0:2:lowGQ:14:.:.:256,14,293,0,57,41:256,14,293,0,57,41./.:11,0,0:11:PASS 1/2:0,0,1:1:lowGQ:14:.:.:115,24,14,36,0,359:115,24,14,36,0,359 1/2:0,0,1:1:PASS:28:.:.:188,39,28,35,0,206:188,39,28,35,0,206 2/2:0,0,3:3:lowGQ:8:1|1:20199249_CT_C:88,88,89,8,9,0:88,88,89,8,9,0

Why are some genotypes being passed when there is no support for their genotype? Why are the Mendelian error rates so high?

Thank you very much in advance,
Alva

↧

Best practice for variant calling on Ion Torrent data with GATK

October 16, 2012, 7:35 am

≫ Next: Realigning sample-merged bam files?

≪ Previous: Unusual calls after using HaplotypeCaller - filtered with VQSR and refinement workflow

Hi,

I wonder how well GATK works with Ion Torrent data. Is there any recommended practice to handle Ion Torrent data, especially SNP and indel calling, with GATK?

Thanks,
XZ

↧

Realigning sample-merged bam files?

May 9, 2015, 7:48 am

≫ Next: Best Practices for Variant Discovery in DNAseq

≪ Previous: Best practice for variant calling on Ion Torrent data with GATK

Following GATK's best practices, I have individually realigned/recalibrated my sample-lane bams and merged them by sample:

sample1_lane1.dedup.realn.recal.bam -->
sample1_lane2.dedup.realn.recal.bam --> sample1.merged.bam
sample2_lane1.dedup.realn.recal.bam -->
sample2_lane2.dedup.realn.recal.bam --> sample2.merged.bam
...

I am ready to dedup and realign my sample merged bams, however I am uncertain on the best approach. Is the consensus to convert back to fastq via Picard (MarkDuplicates, SortSam, and SamToFastq) and then run bwa mem? Or is it more expedient/accurate to realign the sample-merged bam using bwa aln followed by bwa sampe?

↧

Best Practices for Variant Discovery in DNAseq

September 13, 2013, 1:41 pm

≫ Next: When to use BQSR?

≪ Previous: Realigning sample-merged bam files?

This article is part of the Best Practices documentation. See http://www.broadinstitute.org/gatk/guide/best-practices for the full documentation set.

This is our recommended workflow for calling variants in DNAseq data from cohorts of samples, in which steps from data processing up to variant calling are performed per-sample, and subsequent steps are performed jointly on all the individuals in the cohort.

The workflow is divided in three main sections that are meant to be performed sequentially:

Pre-processing: from raw DNAseq sequence reads (FASTQ files) to analysis-ready reads (BAM files)
Variant discovery: from reads (BAM files) to variants (VCF files)
Refinement and evaluation: genotype refinement, functional annotation and callset QC

Pre-Processing

The data generated by the sequencers are put through some pre-processing steps to make it suitable for variant calling analysis. The steps involved are: Mapping and Marking Duplicates; Local Realignment Around Indels; and Base Quality Score Recalibration (BQSR); performed in that order.

Mapping and Marking Duplicates

The sequence reads are first mapped to the reference using BWA mem to produce a file in SAM/BAM format sorted by coordinate. The next step is to mark duplicates. The rationale here is that during the sequencing process, the same DNA molecules can be sequenced several times. The resulting duplicate reads are not informative and should not be counted as additional evidence for or against a putative variant. The duplicate marking process identifies these reads as such so that the GATK tools know they should ignore them.

Realignment Around Indels

Next, local realignment is performed around indels, because the algorithms that are used in the initial mapping step tend to produce various types of artifacts. For example, reads that align on the edges of indels often get mapped with mismatching bases that might look like evidence for SNPs, but are actually mapping artifacts. The realignment process identifies the most consistent placement of the reads relative to the indel in order to clean up these artifacts. It occurs in two steps: first the program identifies intervals that need to be realigned, then in the second step it determines the optimal consensus sequence and performs the actual realignment of reads.

Base Quality Score Recalibration

Finally, base quality scores are recalibrated, because the variant calling algorithms rely heavily on the quality scores assigned to the individual base calls in each sequence read. These scores are per-base estimates of error emitted by the sequencing machines. Unfortunately the scores produced by the machines are subject to various sources of systematic error, leading to over- or under-estimated base quality scores in the data. Base quality score recalibration is a process in which we apply machine learning to model these errors empirically and adjust the quality scores accordingly. This yields more accurate base qualities, which in turn improves the accuracy of the variant calls. The base recalibration process involves two key steps: first the program builds a model of covariation based on the data and a set of known variants, then it adjusts the base quality scores in the data based on the model.

Variant Discovery

Once the data has been pre-processed as described above, it is put through the variant discovery process, i.e. the identification of sites where the data displays variation relative to the reference genome, and calculation of genotypes for each sample at that site. Because some of the variation observed is caused by mapping and sequencing artifacts, the greatest challenge here is to balance the need for sensitivity (to minimize false negatives, i.e. failing to identify real variants) vs. specificity (to minimize false positives, i.e. failing to reject artifacts). It is very difficult to reconcile these objectives in a single step, so instead the variant discovery process is decomposed into separate steps: variant calling (performed per-sample), joint genotyping (performed per-cohort) and variant filtering (also performed per-cohort). The first two steps are designed to maximize sensitivity, while the filtering step aims to deliver a level of specificity that can be customized for each project.

Per-Sample Variant Calling

We perform variant calling by running the HaplotypeCaller on each sample BAM file (if a sample's data is spread over more than one BAM, then pass them all in together) to create single-sample gVCFs. If there are more than a few hundred samples, we combine the gVCFs in batches of ~200 gVCFs using a specialized tool, CombineGVCFs. This will make the next step more tractable and reflects that the processing bottleneck lies with the number of input files and not the number of samples in those files.

Joint Genotyping

All available samples are then jointly genotyped by taking the gVCFs produced earlier and running GenotypeGVCFs on all of them together to create a set of raw SNP and indel calls. This cohort-wide analysis empowers sensitive detection of variants even at difficult sites.

Variant Quality Score Recalibration

Variant recalibration involves using a machine learning method to assign a well-calibrated probability to each variant call in a raw call set. We can then use this variant quality score in the second step to filter the raw call set, thus producing a subset of calls with our desired level of quality, fine-tuned to balance specificity and sensitivity.

Refinement and evaluation

In this last section, we perform some refinement steps on the genotype calls (GQ estimation and transmission phasing), add functional annotations if desired, and do some quality evaluation by comparing the callset to known resources. None of these steps are absolutely required, and the workflow may need to be adapted quite a bit to each project's requirements.

Important note on GATK versions

The Best Practices have been updated for GATK version 3. If you are running an older version, you should seriously consider upgrading. For more details about what has changed in each version, please see the Version History section. If you cannot upgrade your version of GATK for any reason, please look up the corresponding version of the GuideBook PDF (also in the Version History section) to ensure that you are using the appropriate recommendations for your version.

↧

When to use BQSR?

May 21, 2015, 12:53 pm

≫ Next: BQSR on single samples or merged BAMs

≪ Previous: Best Practices for Variant Discovery in DNAseq

Hi,

I am new to NGS/GATK & have a paired-end targeted (targeted to 1 region on 1 autosome) illumina sequencing project on roughly 470 samples. I was wondering whether Indel Realignment & BQSR were appropriate for my setting, or what considerations should be taken into account when judging whether these steps are appropriate?

I'm curious because I noticed this text here:
Small targeted experiments
The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets.

I was unsure whether my project qualified as a 'small dataset' where BQSR should not be run.

Thanks for any help you may be able to offer!

↧

BQSR on single samples or merged BAMs

May 23, 2015, 2:49 pm

≫ Next: GATK usage for Targetted Exome sequencing data analysis

≪ Previous: When to use BQSR?

Hi,
I have ~150 WGS all sequenced in batches on Illumina HiSeq over a number of years, on a number of machines/flowcells.

I want to perform BQSR on my BAM files before calling variants. However for my organism I do not have access to a resource like dbSNP for true variants. So I am following the protocol by doing a first round of variant calling, subsetting to the SNPs with highest confidence and using these as the known variants for BQSR.

My question is, should I carry this out on samples individually, i.e. one BAM per sample, on which I use HaplotypeCaller for initial variant calling, then subset to best SNPs, use these for BaseRecalibrator and apply the calibration to the original sample before carrying on with single sample variant finding and then joint genotyping as per best practices....

As I have multiple samples from the same flowcells and lanes, would I gain more information by combining those samples into a multisample BAM first, before carrying out BQSR? I'm a little unsure of how the covariates used by BQSR and actually calculated and whether I can increase the accuracy of my recalibration in this way? Each sample has between 500M and 5 billion nucleotides sequenced.

Many thanks.

↧

GATK usage for Targetted Exome sequencing data analysis

June 5, 2015, 2:10 am

≫ Next: Away Workshop (Philadelphia) 2014 Sep 18-19: Best Practices for Variant Calling with GATK

≪ Previous: BQSR on single samples or merged BAMs

We are using a licensed version of GATK here.
The version is GATK version -2014.3-3.2.2-7-g f9cba99.

While using the tool for analysis of exome data I had few questions.

1: Are there different parameters used in GATK at different stages for whole and targeted exom sequencing data ?

2: What are the guidelines being followed for targeted exome seq analysis if I use BQSR ?

I was reading the page ("https://www.broadinstitute.org/gatk/guide/tagged?tag=exome") and it was mentioned as " The same guidelines as for whole exome analysis apply except you do not run BQSR on small datasets."

Request you to kindly clear the doubts.

↧

Away Workshop (Philadelphia) 2014 Sep 18-19: Best Practices for Variant Calling with GATK

October 31, 2014, 3:45 pm

≫ Next: GenotypeGVCFs

≪ Previous: GATK usage for Targetted Exome sequencing data analysis

Eric Banks, Sheila Chandran and Geraldine Van der Auwera presented this workshop in Philadelphia, PA, upon invitation from the School of Medicine at UPenn.

This workshop covered all the core steps involved in calling variants with the GATK, using the “Best Practices” developed by the GATK team. The presentation materials describe why each step is essential to the calling process, what are the key operations performed on the data at each step, and how to use the GATK tools to get the most accurate and reliable results out of your dataset. This includes specific information about variant calling in RNAseq data and efficient analysis of cohorts.

The material was presented over two days, organized in the following modules:

Data pre-processing: From FASTQ to analysis-ready BAM
Variant Discovery: From BAM to analysis-ready VCF

This was complemented by a set of hands-on exercises aiming to teach basic GATK usage to new users.

The workshop materials are available at this link if you're viewing this post in the forum, or below if you are viewing the presentation page already.

↧

GenotypeGVCFs

July 22, 2015, 4:46 am

≫ Next: Cancer variant calling workflow with mutect and changes/additions to the GATK best practices

≪ Previous: Away Workshop (Philadelphia) 2014 Sep 18-19: Best Practices for Variant Calling with GATK

I was trying to do combine sets of vcf files for all my samples so that I have one single vcf output using this command option below
java -d64 -Xmx48g -jar ${GATK}/GenomeAnalysisTK.jar \
-R ${REF} \
-T GenotypeGVCFs \
--variant A.g.vcf \
--variant B.g.vcf \
--variant C.g.vcf \
-stand_emit_conf 30 \
-stand_call_conf 30 \
-o genotype.vcf

but I got this error message
“The following invalid GT allele index was encountered in the file: END=21994810”. I have tried to locate where the problem could be coming from but I do not understand this. Could you please advise me.

↧

Cancer variant calling workflow with mutect and changes/additions to the GATK best practices

August 5, 2015, 2:53 am

≫ Next: BQSR - perform on whole genome or single chromosomes OK?

≪ Previous: GenotypeGVCFs

Hi all,

I'm working on a cancer variant calling workflow (DNAseq) with MuTect as the variant caller, rather than HaplotypeCaller (due to it's low AF intolerance).
I've gone through the forum and couldn't find a full answer to my inquiry. My question has 2 parts actually:

what changes/additions should I implement in my pipeline/workflow to the GATK Best practices workflow for variant analysis to make it more suitable for cancer variant calling with MuTect, taking into account that I don't have a normal tissue sample, just tumor and an hg19 reference from the bundle:
https://www.broadinstitute.org/gatk/guide/best-practices?bpm=DNAseq#data-processing-ovw
Is there a difference between the recommended "--known" variants/sites of the GATK's best practices workflow for general variant calling and the one for cancer variant calling with MuTect as the variant caller? Specifically to this article:
http://gatkforums.broadinstitute.org/discussion/1247/what-should-i-use-as-known-variants-sites-for-running-tool-x

Thank you very much in advance,
Alon

↧