(How to) Mark duplicates with MarkDuplicates or MarkDuplicatesWithMateCigar

January 8, 2016, 1:06 pm

≫ Next: how to Identify read group information

≪ Previous: Unparsable vcf record with allele *

This tutorial updates Tutorial#2799.

Here we discuss two tools, MarkDuplicates and MarkDuplicatesWithMateCigar, that flag duplicates. We provide example data and example commands for you to follow along the tutorial (section 1) and include tips in estimating library complexity for PCR-free samples and patterned flow cell technologies. In section 2, we point out special memory considerations for these tools. In section 3, we highlight the similarities and differences between the two tools. Finally, we get into some details that may be of interest to some that includes comments on the metrics file (section 4).

To mark duplicates in RNA-Seq data, use MarkDuplicates. Reasons are explained in section 2 and section 3. And if you are considering using MarkDuplicatesWithMateCigar for your DNA data, be sure insert lengths are short and you have a low percentage of split or multi-mapping records.

Obviously, expect more duplicates for samples prepared with PCR than for PCR-free preparations. Duplicates arise from various sources, including within the sequencing run. As such, even PCR-free data can give rise to duplicates, albeit at low rates, as illustrated here with our example data.

We welcome feedback. Share your suggestions in the Comment section at the bottom of this page.

Jump to a section

Tools involved

Prerequisites

Installed Picard tools
Coordinate-sorted and indexed BAM alignment data. Secondary/supplementary alignments are flagged appropriately (256 and 2048 flags) and additionally with the mate unmapped (8) flag. See the MergeBamAlignment section (3C) of Tutorial#6483 for a description of how MergeBamAlignment ensures such flagging.
For MarkDuplicatesWithMateCigar, pre-computed Mate CIGAR (MC) tags. Data produced according to Tutorial#6483 will have the MC tags added by MergeBamAlignment. Alternatively, see tools RevertOriginalBaseQualitiesAndAddMateCigar and FixMateInformation.
Appropriately assigned Read Group (RG) information. Read Group library (RGLB) information is factored for molecular duplicate detection. Optical duplicates are limited to those from the same RGID.

Download example data

Use the advanced tutorial bundle's human_g1k_v37_decoy.fasta as reference. This same reference is available to load in IGV.
tutorial_6747.tar.gz data contain human paired 2x150 whole genome sequence reads originally aligning at ~30x depth of coverage. The sample is a PCR-free preparation of the NA12878 individual run on the HiSeq X platform. This machine type, along with HiSeq 4000, has the newer patterned flow cell that differs from the typical non-patterned flow cell. I took the reads aligning to a one Mbp genomic interval (10:96,000,000-97,000,000) and sanitized and realigned the reads (BWA-MEM -M) to the entire genome according to the workflow presented in Tutorial#6483 to produce snippet.bam. The data has (i) no supplementary records; (ii) secondary records flagged with the 256 flag and the mate-unmapped (8) flag; and (iii) unmapped records (4 flag) with mapped mates (mates have 8 flag), zero MAPQ (column 5) and asterisks for CIGAR (column 6). The notation allows read pairs where one mate maps and the other does not to sort and remain together when we apply genomic intervals such as in the generation of the snippet.

Related resources

See DuplicationMetrics for descriptions of each metric.
See Tutorial#6483 for instructions on how to efficiently map and clean up short read sequence data. You can use the resulting files directly in this tutorial.
See an overview of lane, library, sample and cohort and this forum discussion of how MarkDuplicates handles library information.
See SAM flags to interpret SAM flag values.
See dictionary entry on Illumina Chastity filter for a link to a document comparing patterned and non-patterned flow cells.
See this tutorial to coordinate-sort and index a BAM.
See this tutorial for basic instructions on using the Integrative Genomics Viewer (IGV).

1. Commands for MarkDuplicates and MarkDuplicatesWithMateCigar

The following commands take a coordinate-sorted and indexed BAM and return (i) a BAM with the same records in coordinate order and with duplicates marked by the 1024 flag, (ii) a duplication metrics file, and (iii) an optional matching BAI index.

For a given file with all MC (mate CIGAR) tags accounted for:

and where all mates are accounted for, each tool--MarkDuplicates and MarkDuplicatesWithMateCigar--examines the same duplicate sets but prioritize which inserts get marked duplicate differently. This situation is represented by our snippet example data.
but containing missing mates records, MarkDuplicates ignores the records, while MarkDuplicatesWithMateCigar still considers them for duplicate marking using the MC tag for mate information. Again, the duplicate scoring methods differ for each tool.

Use the following commands to flag duplicates for 6747_snippet.bam. These commands produce qualitatively different data.

Score duplicate sets based on the sum of base qualities using MarkDuplicates:

java -Xmx32G -jar picard.jar MarkDuplicates \
INPUT=6747_snippet.bam \ #specify multiple times to merge 
OUTPUT=6747_snippet_markduplicates.bam \
METRICS_FILE=6747_snippet_markduplicates_metrics.txt \ 
OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 \ #changed from default of 100
CREATE_INDEX=true \ #optional
TMP_DIR=/tmp

Score duplicate sets based on total mapped reference length using MarkDuplicatesWithMateCigar:

java -Xmx32G -jar picard.jar MarkDuplicatesWithMateCigar \
INPUT=6747_snippet.bam \ #specify multiple times to merge
OUTPUT=6747_snippet_markduplicateswithmatecigar.bam \
METRICS_FILE=6747_snippet_markduplicateswithmatecigar_metrics.txt \ 
OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 \ #changed from default of 100
CREATE_INDEX=true \ #optional
TMP_DIR=/tmp

Comments on select parameters

Each tool has a distinct default DUPLICATE_SCORING_STRATEGY. For MarkDuplicatesWithMateCigar it is TOTAL_MAPPED_REFERENCE_LENGTH and this is the only scoring strategy available. For MarkDuplicates you can switch the DUPLICATE_SCORING_STRATEGY between the default SUM_OF_BASE_QUALITIES and TOTAL_MAPPED_REFERENCE_LENGTH. Both scoring strategies use information pertaining to both mates in a pair, but in the case of MarkDuplicatesWithMateCigar the information for the mate comes from the read's MC tag and not from the actual mate.
To merge multiple files into a single output, e.g. when aggregating a sample from across lanes, specify the INPUT parameter for each file. The tools merge the read records from the multiple files into the single output file. The tools marks duplicates for the entire library (RGLB) and accounts for optical duplicates per RGID. INPUT files must be coordinate sorted and indexed.
The Broad's production workflow increases OPTICAL_DUPLICATE_PIXEL_DISTANCE to 2500, to better estimate library complexity. The default setting for this parameter is 100. Changing this parameter does not alter duplicate marking. It only changes the count for optical duplicates and the library complexity estimate in the metrics file in that whatever is counted as an optical duplicate does not factor towards library complexity. The increase has to do with the fact that our example data was sequenced in a patterned flow cell of a HiSeq X machine. Both HiSeq X and HiSeq 4000 technologies decrease pixel unit area by 10-fold and so the equivalent pixel distance in non-patterned flow cells is 250. You may ask why are we still counting optical duplicates for patterned flow cells that by design should have no optical duplicates. We are hijacking this feature of the tools to account for other types of duplicates arising from the sequencer. Sequencer duplicates are not limited to optical duplicates and should be differentiated from PCR duplicates for more accurate library complexity estimates.
By default the tools flag duplicates and retain them in the output file. To remove the duplicate records from the resulting file, set the REMOVE_DUPLICATES parameter to true. However, given you can set GATK tools to include duplicates in analyses by adding -drf DuplicateRead to commands, a better option for value-added storage efficiency is to retain the resulting marked file over the input file.
To optionally create a .bai index, add and set the CREATE_INDEX parameter to true.

For snippet, the duplication metrics are identical whether marked by MarkDuplicates or MarkDuplicatesWithMateCigar. We have 13.4008% duplication, with 255 unpaired read duplicates and 18,254 paired read duplicates. However, as the screenshot at the top of this page illustrates, and as section 4 explains, the data qualitatively differ.

2. Slow or out of memory error? Special memory considerations for duplicate marking tools

The seemingly simple task of marking duplicates is one of the most memory hungry processes, especially for paired end reads. Both tools are compute-intensive and require upping memory compared to other processes.

Because of the single-pass nature of MarkDuplicatesWithMateCigar, for a given file its memory requirements can be greater than for MarkDuplicates. What this means is that MarkDuplicatesWithMateCigar streams the duplicate marking routine in a manner that allows for piping. Due to these memory constraints for MarkDuplicatesWithMateCigar, we recommend MarkDuplicates for alignments that have large reference skips, e.g. spliced RNA alignments.

For large files, (1) use the Java -Xmx setting and (2) set the environmental variable TMP_DIR for a temporary directory. These options allow the tool to run without slowing down as well as run without causing an out of memory error. For the purposes of this tutorial, commands are given as if the example data is a large file, which we know it is not.

    java -Xmx32G -jar picard.jar MarkDuplicates \
    ... \
    TMP_DIR=/tmp

These options can be omitted for small files such as the example data and the equivalent command is as follows.

    java -jar picard.jar MarkDuplicates ...

Set the java maxheapsize, specified by the `-Xmx#G` option, to the maximum your system allows.

The high memory cost, especially for MarkDuplicatesWithMateCigar, is in part because the tool systematically traverses genomic coordinate intervals for inserts in question, and for every read it marks as a duplicate it must keep track of the mate, which may or may not map nearby, so that reads are marked as pairs with each record emitted in its coordinate turn. In the meanwhile, this information is held in memory, which is the first choice for faster processing, until the memory limit is reached, at which point memory spills to disk. We set this limit high to minimize instances of memory spilling to disk.

In the example command, the -Xmx32G Java option caps the maximum heap size, or memory usage, to 32 gigabytes, which is the limit on the server I use. This is in contrast to the 8G setting I use for other processes on the same sample data--a 75G BAM file. To find a system's default maximum heap size, type java -XX:+PrintFlagsFinal -version, and look for MaxHeapSize.

Set an additional temporary directory with the `TMP_DIR` parameter for memory spillage.

When the tool hits the memory limit, memory spills to disk. This causes data to traverse in and out of the processor's I/O device, slowing the process down. Disk is a location you specify with the TMP_DIR parameter. If you work on a server separate from where you read and write files to, setting TMP_DIR to the server's local temporary directory (typically /tmp) can reduce processing time compared to setting it to the storage disk. This is because the tool then additionally avoids traversing the network file system when spilling memory. Be sure the TMP_DIR location you specify provides enough storage space. Use df -h to see how much is available.

3. Conceptual overview of duplicate flagging

The aim of duplicate marking is to flag all but one of a duplicate set as duplicates and to use duplicate metrics to estimate library complexity. Duplicates have a higher probability of being non-independent measurements from the exact same template DNA. Duplicate inserts are marked by the 0x400 bit (1024 flag) in the second column of a SAM record, for each mate of a pair. This allows downstream GATK tools to exclude duplicates from analyses (most do this by default). Certain duplicates, i.e. PCR and sequencer duplicates, violate assumptions of variant calling and also potentially amplify errors. Removing these, even at the cost of removing serendipitous biological duplicates, allows us to be conservative in calculating the confidence of variants.

GATK tools allow you to disable the duplicate read filter with -drf DuplicateRead so you can include duplicates in analyses.

For a whole genome DNA sample, duplicates arise from three sources: (i) in DNA shearing from distinct molecular templates identical in insert mapping, (ii) from PCR amplification of a template (PCR duplicates), and (iii) from sequencing, e.g. optical duplicates. The tools cannot distinguish between these types of duplicates with the exception of optical duplicates. In estimating library complexity, the latter two types of duplicates are undesirable and should each factor differently.

When should we not care about duplicates? Given duplication metrics, we can make some judgement calls on the quality of our sample preparation and sequencer run. Of course, we may not expect a complex library if our samples are targeted amplicons. Also, we may expect minimal duplicates if our samples are PCR-free. Or it may be that because of the variation inherent in expression level data, e.g. RNA-Seq, duplicate marking becomes ritualistic. Unless you are certain of your edge case (amplicon sequencing, RNA-Seq allele-specific expression analysis, etc.) where duplicate marking adds minimal value, you should go ahead and mark duplicates. You may find yourself staring at an IGV session trying to visually calculate the strength of the evidence for a variant. We can pat ourselves on the back for having the forethought to systematically mark duplicates and turn on the IGV duplicate filter.

The Broad's Genomics Platform uses MarkDuplicates twice for multiplexed samples. Duplicates are flagged first per sample per lane to estimate library complexity, and second to aggregate data per sample while marking all sample duplicates. In the second pass, duplicate marking tools again assess all reads for duplicates and overwrite any prior flags.

Our two duplicate flagging tools share common features but differ at the core. As the name implies, MarkDuplicatesWithMateCigar uses the MC (mate CIGAR) tag for mate alignment information. Unlike MarkDuplicates, it is a single-pass tool that requires pre-computed MC tags.

For RNA-Seq data mapped against the genome, use MarkDuplicates. Specifically, MarkDuplicatesWithMateCigar will refuse to process data with large reference skips frequent in spliced RNA transcripts where the gaps are denoted with an N in the CIGAR string.
Both tools only consider primary mappings, even if mapped to different contigs, and ignore secondary/supplementary alignments (256 flag and 2048 flag) altogether. Because of this, before flagging duplicates, be sure to mark primary alignments according to a strategy most suited to your experimental aims. See MergeBamAlignment's PRIMARY_ALIGNMENT_STRATEGY parameter for strategies the tool considers for changing primary markings made by an aligner.
Both tools identify duplicate sets identically with the exception that MarkDuplicatesWithMateCigar additionally considers reads with missing mates. Missing mates occur for example when aligned reads are filtered using an interval list of genomic regions. This creates divorced reads whose mates aligned outside the targeted intervals.
Both tools identify duplicates as sets of read pairs that have the same unclipped alignment start and unclipped alignment end. The tools intelligently factor for discordant pair orientations given these start and end coordinates. Within a duplicate set, with the exception of optical duplicates, read pairs may have either pair orientation--F1R2 or F2R1. For optical duplicates, pairs in the set must have the same orientation. Why this is is explained in section 4.
Both tools take into account clipped and gapped alignments and singly mapping reads (mate unmapped and not secondary/supplementary).
Each tool flags duplicates according to different priorities. MarkDuplicatesWithMateCigar prioritizes which pair to leave as the representative nondup based on the total mapped length of a pair while MarkDuplicates can prioritize based on the sum of base qualities of a pair (default) or the total mapped length of a pair. Duplicate inserts are marked at both ends.

4. Details of interest to some

To reach a high target coverage depth, some fraction of sequenced reads will by stochastic means be duplicate reads.

Let us hope the truth of a variant never comes down to so few reads that duplicates should matter so. Keep in mind the better evidence for a variant is the presence of overlapping reads that contain the variant. Also, take estimated library complexity at face value--an estimate.

Don't be duped by identical numbers. Data from the two tools qualitatively differ.

First, let me reiterate that secondary and supplementary alignment records are skipped and never flagged as duplicate.

Given a file with no missing mates, each tool identifies the same duplicate sets from primary alignments only and therefore the same number of duplicates. To reiterate, the number of identical loci or duplicate sets and the records within each set are the same for each tool. However, each tool differs in how it decides which insert(s) within a set get flagged and thus which insert remains the representative nondup. Also, if there are ties, the tools may break them differently in that tie-breaking can depend on the sort order of the records in memory.

MarkDuplicates by default prioritizes the sum of base qualities for both mates of a pair. The pair with the highest sum of base qualities remains as the nondup.
As a consequence of using the mate's CIGAR string (provided by the MC tag), MarkDuplicatesWithMateCigar can only prioritize the total mapped reference length, as provided by the CIGAR string, in scoring duplicates in a set. The pair with the longest mapping length remains as the nondup.
If there are ties after applying each scoring strategy, both tools break the ties down a chain of deterministic factors starting with read name.

Duplicate metrics in brief

We can break down the metrics file into two parts: (1) a table of metrics that counts various categories of duplicates and gives the library complexity estimate, and (2) histogram values in two columns.

See DuplicationMetrics for descriptions of each metric. For paired reads, duplicates are considered for the insert. For single end reads, duplicates are considered singly for the read, increasing the likelihood of being identified as a duplicate. Given the lack of insert-level information for these singly mapping reads, the insert metrics calculations exclude these.

The library complexity estimate only considers the duplicates that remain after subtracting out optical duplicates. For the math to derive estimated library size, see formula (1.2) in Mathematical Notes on SAMtools Algorithms.

The histogram values extrapolate the calculated library complexity to a saturation curve plotting the gains in complexity if you sequence additional aliquots of the same library. The first bin's value represents the current complexity.

Pair orientation F1R2 is distinct from F2R1 for optical duplicates

Here we refer you to a five minute video illustrating what happens at the molecular level in a typical sequencing by synthesis run.

What I would like to highlight is that each strand of an insert has a chance to seed a different cluster. I will also point out, due to sequencing chemistry, F1 and R1 reads typically have better base qualities than F2 and R2 reads.

Optical duplicate designation requires the same pair orientation.

Let us work out the implications of this for a paired end, unstranded DNA library. During sequencing, within the flow cell, for a particular insert produced by sample preparation, the strands of the insert are separated and each strand has a chance to seed a different cluster. Let's say for InsertAB, ClusterA and ClusterB and for InsertCD, ClusterC and ClusterD. InsertAB and InsertCD are identical in sequence and length and map to the same loci. It is possible InsertAB and InsertCD are PCR duplicates and also possible they represent original inserts. Each strand is then sequenced in the forward and reverse to give four pieces of information in total for the given insert, e.g. ReadPairA and ReadPairB for InsertAB. The pair orientation of these two pairs are reversed--one cluster will give F1R2 and the other will give F2R1 pair orientation. Both read pairs map exactly to the same loci. Our duplicate marking tools consider ReadPairA and ReadPairB in the same duplicate set for regular duplicates but not for optical duplicates. Optical duplicates require identical pair orientation.

↧

how to Identify read group information

January 25, 2016, 5:25 pm

≫ Next: Presentation slides for GATK workshops in Australia, Feb 1-4

≪ Previous: (How to) Mark duplicates with MarkDuplicates or MarkDuplicatesWithMateCigar

we began GATK with "dentify read group information", however,i'm still wondering that the format "@RG\tID:group1\tSM:sample1\tPL:illumina\tLB:lib1\tPU:unit1 ",does the "LB:lib1" and "PU:unit1" should be different when the sample changed?

↧

Presentation slides for GATK workshops in Australia, Feb 1-4

January 31, 2016, 2:32 pm

≫ Next: GATK best practices for exome/targeted capture (small region)

≪ Previous: how to Identify read group information

This week we are teaching a couple of GATK Best Practices workshops in Sydney (one Opera House, check) and Melbourne, Australia. Joining me are Methods Team developers Mark and Megan.

The presentation slides from Day 1 as well as the materials for the Day 2 hands-on tutorial are available at this Google Drive link.

Learned so far: the fierceness of the Australian sun is no laughing matter

↧

GATK best practices for exome/targeted capture (small region)

January 29, 2016, 12:13 pm

≫ Next: Is new MQ jittering functionality recommended?

≪ Previous: Presentation slides for GATK workshops in Australia, Feb 1-4

Hi,
I was wondering if there is a set of best practices developed specifically for calling variants from a small region target capture sample. I looked but I could not find a one-stop comprehensive documentation. If not, are there some specific recommendations (such as certain values of parameters for the exome/targetome analysis? For eg., -L.

Thanks!

↧

Is new MQ jittering functionality recommended?

January 28, 2016, 7:08 am

≫ Next: RNA-seq variant calling and merging of sample replicates

≪ Previous: GATK best practices for exome/targeted capture (small region)

The release notes for 3.5 state "Added new MQ jittering functionality to improve how VQSR handles MQ". My understanding is that in order to use this, we will need to use the --MQCapForLogitJitterTransform argument in VariantRecalibrator. I have a few questions on this:
1) Is the above correct, i.e. the new MQ jittering functionality is only used if --MQCapForLogitJitterTransform is set to something other than the default value of zero?
2) Is the use of MQCapForLogitJitterTransform recommended?
3) If we do use MQCapForLogitJitterTransform, the tool documentation states "We recommend to either use --read-filter ReassignOriginalMQAfterIndelRealignment with HaplotypeCaller or use a MQCap=max+10 to take that into account". Is one of these to be preferred over the other? Given that it seems that sites that have been realigned can have values up to 70, but sites that have not can have values no higher than 60, it seems to me that the ReassignOriginalMQAfterIndelRealignment with HaplotypeCaller option might be preferred, but I would like to check before running.

↧

RNA-seq variant calling and merging of sample replicates

February 5, 2016, 1:25 am

≫ Next: Call Varients from RNA-seq for RNA editing detection

≪ Previous: Is new MQ jittering functionality recommended?

Hello, and thanks for making all the GATK tools! I have recently started to try my hand at variant calling of my RNA-seq data, following the GATK Best Practices more or less verbatim, only excluding indel alignment (because I am only interested in SNPs at this point) and the BQSR (partly because I have very high quality data, but mostly because I couldn't get it to work in the workflow).

I have three replicates for each of my samples, and my question is where, if at all, I should merge the data from them. I am not sure if I can (or even should!) merge the FASTQ files before the alignment step, or merge the aligned BAM files, or something else. I read that for aligners such as BWA the options are (more or less) equivalent, but seeing as the RNA-seq Best Practice workflow using STAR... What would be the "correct" way to do it, if at all? How would merging (at some level) affect the speed of the workflow, and can I optimise that somehow?

If it's a bad idea to do merging, how would I determine the "true" variant from my three resulting VCF-files at the end, for cases where they differ?

↧

Call Varients from RNA-seq for RNA editing detection

February 9, 2016, 5:06 am

≫ Next: GenotypeGVCFs with draft quality reference genome

≪ Previous: RNA-seq variant calling and merging of sample replicates

Dear GATK staff

I would like to use the GATK tool for the detection of possible RNA editing events. I followed the RNA-seq best practice up to the variant calling step itself. There I hesitate to use the haplotype caller because I would not assume that the editing sites follow any kind of allelic ratio. Therefore I wanted to ask if it might be better to use MuTec2 at this stage? I would call it like ...

java -jar GenomeAnalysisTK.jar -T MuTect2 -R reference.fasta -I:tumor normal1.bam -dontUseSoftClippedBases -stand_call_conf 20.0 -stand_emit_conf 20.0 --dbsnp dbSNP.vcf --artifact_detection_mode -o output.normal1.vcf
java -jar GenomeAnalysisTK.jar -T CombineVariants -R reference.fasta -V output.normal1.vcf -V output.normal2.vcf -minN 2 --setKey "null" --filteredAreUncalled --filteredrecordsmergetype KEEP_IF_ANY_UNFILTERED -o MuTect2_PON.vcf

Can you comment if this is a suitable modification of the best practice in the case of RNA editing calls?

bw,
Fabian

↧

GenotypeGVCFs with draft quality reference genome

February 9, 2016, 8:42 pm

≫ Next: GATK Workshops World Tour 2016

≪ Previous: Call Varients from RNA-seq for RNA editing detection

I am using the GATK pipeline to call variants by aligning reads to a draft quality reference genome that is ~367000 scaffolds. I split the scaffolds up into 50 intervals and successfully (and pretty quickly) generated GVCFs for 25 individuals using the -L option. However, I am having the worst of times with GenotypeGVCFs. After running for nearly 2 days on the first interval list, GenotypeGVCFs has not even output a file. Based on another post in the forum, I removed the scaffolds that are NOT in the interval from the GVCF header, and that sped up the process slightly - I have a combined VCF file with just the header generated after about 18 hours. Not sure how much longer the process has as the progress meter doesn't seem to be making any sense.

Is there any known way(s) to optimize this process?

Currently using the following command:
java -Djava.io.tmpdir=/data/lwwvd/genoGVCF.tmp -XX:ParallelGCThreads=4 -Xmx15g -jar /usr/local/bin/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -nt 16 -T GenotypeGVCFs -R ../ref_genomes/bbu_ref_UMD_CASPUR_WB_2.0.fa -L interval_lists/bbub.refctgs.49.interval_list -V ./1095/1095.49.g.vcf.gz -V ./189/189.49.g.vcf.gz -V ./190/190.49.g.vcf.gz -V ./196/196.49.g.vcf.gz -V ./246/246.49.g.vcf.gz -V ./337/337.49.g.vcf.gz -V ./581/581.49.g.vcf.gz -V ./583/583.49.g.vcf.gz -V ./662/662.49.g.vcf.gz -V ./701/701.49.g.vcf.gz -V ./850/850.49.g.vcf.gz -V ./92764/92764.49.g.vcf.gz -V ./92765/92765.49.g.vcf.gz -V ./92766/92766.49.g.vcf.gz -V ./92767/92767.49.g.vcf.gz -V ./92768/92768.49.g.vcf.gz -V ./92769/92769.49.g.vcf.gz -V ./92770/92770.49.g.vcf.gz -V ./92771/92771.49.g.vcf.gz -V ./92774/92774.49.g.vcf.gz -V ./92775/92775.49.g.vcf.gz -V ./92776/92776.49.g.vcf.gz -V ./92777/92777.49.g.vcf.gz -V ./92778/92778.49.g.vcf.gz -V ./92795/92795.49.g.vcf.gz -o BBUB.combined.49.vcf

↧

GATK Workshops World Tour 2016

February 11, 2016, 5:18 pm

≫ Next: URLto a guide is depreciated

≪ Previous: GenotypeGVCFs with draft quality reference genome

For the last few years, we have taught GATK workshops at institutions in various countries around the world. It started out as a small-scale effort with one or two workshops a year in addition to the annual workshop hosted at the Broad itself, but it seems that word has gone around because our dance card is filling up faster every year. Here's a quick recap of where we stand for the 2016 season.

The tour so far

We kicked off this year's World Tour in style -- on the other side of the world! In two back-to-back 2-day workshops in Sydney and Melbourne, our 3-person workshop crew walked a diverse assortment of researchers and bioinformaticians through the essentials of GATK Best Practices for Variant Discovery, in theory on Day 1 and in practice on Day 2. Lots of great questions were asked, most were answered, and a lot of coffee was consumed. Our thanks to Bioplatforms Australia for setting up these workshops at UNSW in Sydney and at the University of Melbourne, and taking great care of us (so much good coffee!). We look forward to hearing from our Aussie users on the forum.

Upcoming locations/dates

We have more workshops coming up in the US and in Europe, and registration is open for some of them! Here's the list up to June:

March 2-4: Los Angeles, USA (UCLA) -- register here
April 14-15: Edinburgh, UK (University of Edinburgh) -- register here
April 18-19: Oxford, UK (University of Oxford)
June 13-14: Cambridge, UK (University of Cambridge) -- register here
June 16-17: Helsinki, Finland (CSC-IT Center for Science)

Remaining registration links will be added when they become available.

We are looking into possible workshop dates for later in the year (September - December).

What if you can't attend?

If you can't make it to one of our workshops in person, check out the Presentations section of the documentation guide. There you can watch the workshop videos from the BroadE workshops, as well as download the workshop slides and tutorial materials from all workshops that we have taught so far. Some of the most recent haven't been posted yet but we'll get that done soon. In the meantime if you're in a hurry you can find those materials in previous posts; look for the "workshops" or "presentations" tags.

↧

URLto a guide is depreciated

May 16, 2016, 6:26 pm

≫ Next: MarkDuplicates: "Exception in thread "main" htsjdk.samtools.SAMException"

≪ Previous: GATK Workshops World Tour 2016

In the "Realign Indels" step of the GATK Best Practices (https://www.broadinstitute.org/gatk/guide/bp_step.php?p=1), the link pointing to the step-by-step tutorial of (howto) Perform local realignment around indels (https://www.broadinstitute.org/gatk/guide/article?id=2800) no longer works. I believe the new URL is https://www.broadinstitute.org/gatk/guide/article?id=7156

↧

MarkDuplicates: "Exception in thread "main" htsjdk.samtools.SAMException"

May 23, 2016, 1:03 pm

≫ Next: MarkDuplicatesWithMateCigar fails at producing metrics files with some samples

≪ Previous: URLto a guide is depreciated

Hi,

When trying to mark duplicates with MarkDuplicates, I constantly get an error (I copied the entire output below my message) while using MarkDuplicatesWithMateCigar seem to work just fine. I tried to modify TMP_DIR a few times already (I tried the same repository as input files, their parental repository, another partition), with no more success. What am I missing?

Ben

[Mon May 23 14:57:20 CDT 2016] picard.sam.markduplicates.MarkDuplicates INPUT=[CPBWGS_11_piped.bam] OUTPUT=CPBWGS_11_markduplicates.bam METRICS_FILE=CPBWGS_11_markduplicates_metrics.txt OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 TMP_DIR=[/data2/CPBWGS/all_fastq_files/markdup_tmp] CREATE_INDEX=true MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX= VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Mon May 23 14:57:20 CDT 2016] Executing as pelissie@denali on Linux 3.13.0-85-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_31-b13; Picard version: 2.2.4(f7bcc560e4e936f8c5782421b8196cda46fab833_1462390762) JdkDeflater
INFO 2016-05-23 14:57:20 MarkDuplicates Start of doWork freeMemory: 2013246576; totalMemory: 2025848832; maxMemory: 30542397440
INFO 2016-05-23 14:57:20 MarkDuplicates Reading input file and constructing read end information.
INFO 2016-05-23 14:57:20 MarkDuplicates Will retain up to 117470759 data points before spilling to disk.
[Mon May 23 14:57:26 CDT 2016] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.09 minutes.
Runtime.totalMemory()=4487905280
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMException: /data2/CPBWGS/all_fastq_files/markdup_tmp/CSPI.8763470411475836719.tmp/12511.tmpnot found
at htsjdk.samtools.util.FileAppendStreamLRUCache$Functor.makeValue(FileAppendStreamLRUCache.java:63)
at htsjdk.samtools.util.FileAppendStreamLRUCache$Functor.makeValue(FileAppendStreamLRUCache.java:49)
at htsjdk.samtools.util.ResourceLimitedMap.get(ResourceLimitedMap.java:76)
at htsjdk.samtools.CoordinateSortedPairInfoMap.getOutputStreamForSequence(CoordinateSortedPairInfoMap.java:180)
at htsjdk.samtools.CoordinateSortedPairInfoMap.put(CoordinateSortedPairInfoMap.java:164)
at picard.sam.markduplicates.util.DiskBasedReadEndsForMarkDuplicatesMap.put(DiskBasedReadEndsForMarkDuplicatesMap.java:65)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:447)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:193)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
Caused by: java.io.FileNotFoundException: /data2/CPBWGS/all_fastq_files/markdup_tmp/CSPI.8763470411475836719.tmp/12511.tmp (Too many open files)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:213)
at htsjdk.samtools.util.FileAppendStreamLRUCache$Functor.makeValue(FileAppendStreamLRUCache.java:60)
... 10 more

↧

MarkDuplicatesWithMateCigar fails at producing metrics files with some samples

May 24, 2016, 9:56 am

≫ Next: MarkDuplicates vs. MarkduplicatesWithMateCigar: what is the best practice?

≪ Previous: MarkDuplicates: "Exception in thread "main" htsjdk.samtools.SAMException"

Hi,

With most of my samples, MarkDuplicatesWithMateCigar works just fine. For some however, while it runs for the same amount of time and seem to behave similarly, it fails at producing the metrics file and tells me that: "Exception in thread "main" picard.PicardException: Found a samRecordWithOrdinal with sufficiently large clipping that we may have missed including it in an early duplicate marking iteration. Please increase the minimum distance to at least 252bp to ensure it is considered (was 252)." What should I do?

Ben

↧

MarkDuplicates vs. MarkduplicatesWithMateCigar: what is the best practice?

May 23, 2016, 12:38 pm

≫ Next: Contamination estimation while following the Best Practices workflow

≪ Previous: MarkDuplicatesWithMateCigar fails at producing metrics files with some samples

Hello,

I am curious to know which tool (MarkDuplicates or MarkduplicatesWithMateCigar) would people advice for marking duplicates (I am following GATK's Best Practices from paired-end DNA reads). I get that MarkduplicatesWithMateCigar also uses CIGAR infos to mark them, but struggle justifying the supposed added value of using this tool vs. the "regular" MarkDuplicates.

Also, selecting representative nondups based on the total mapped length of a pair (ie, MarkduplicatesWithMateCigar method) rather than on the sum of base qualities of a pair (ie, Markduplicates method) seems more intuitive to me. Why would one prefer the latter over the former?

Ben

↧

Contamination estimation while following the Best Practices workflow

July 7, 2016, 1:12 pm

≫ Next: Status for dealing with paired-end reads

≪ Previous: MarkDuplicates vs. MarkduplicatesWithMateCigar: what is the best practice?

Hi,

I've started following the Best Practices guide to process >80 whole-genome samples (reference-confidence model workflow). During variant calling via HaplotypeCaller, a few of the gVCFs blew up in size (> 1TB, while the rest of gVCFs were around 75 GB). I suspect that these samples are contaminated. I'd like to:

1) Confirm that the samples that produced large gVCFs are contaminated
2) Check that the rest of the samples are unaffected (higher priority)

I've started looking at GATK-friendly ConTest [1]. It seems that it requires genotyped VCF files as input. Is it appropriate to call GenotypeGVCFs individually for each gVCF and use the produced VCF as input for ConTest? Or can I use "raw" gVCFs?

I've also considered using cleanCall [2], but sounds like I'd need to repeat variant calling using samtools (took ages to produce gVCFs, so trying avoid this) or somehow do the gVCF -> VCF -> PED conversion.

Thank you!

[1] https://www.broadinstitute.org/cancer/cga/contest_run
[2] https://github.com/hyunminkang/cleanCall

↧

Status for dealing with paired-end reads

July 26, 2016, 1:03 am

≫ Next: sample prepared from multiple flowcell and multiple lanes

≪ Previous: Contamination estimation while following the Best Practices workflow

Hi all, new here and its my first post, so sorry if its not relevant enough.

I have been thinking about the same problem that occurred here twice already, and i cannot seem to come to a conclusion on what do to. The problem is how to interpret paired-end reads overlapping regions. Some "commercial" software does the merging prior to any further analysis and then reports a variants coverage in relation to the merged reads. This makes sense for me, since i get a single-molecule coverage. Here in GATK as i've read in posts:
http://gatkforums.broadinstitute.org/gatk/discussion/5559/haplotype-caller-and-overlapping-paired-end-reads
http://gatkforums.broadinstitute.org/gatk/discussion/1745/does-unifiedgenotyper-count-overlapping-paired-end-reads-as-two-separate-reads
It seems that the information from the overlap is modelled (to account for varying base qualities and possible differences) but the reported coverage seems inflated. When i run the same sample through the "commercial" counterpart and GATK i get higher coverages on variants consistently.

The real problem i have with this is, that when dealing with standards and accreditations one wants to claim that variants are given for regions with coverage above 30X. But for the HaplotypeCaller, what does that mean? Should it be 60X (which would be too strict in many cases and undercut a lot of data from WES projects).

To boil it down - is there a way to bring the coverage to a value reflecting the molecules sequenced (so consider the overlap phred scores but also change the coverage to 1) or should there be some step of read-merging prior?

Sorry for a long and convoluted post, hope to hear from you!
Piotr

↧

sample prepared from multiple flowcell and multiple lanes

August 2, 2016, 12:12 pm

≫ Next: GATK best practices somatic mutations

≪ Previous: Status for dealing with paired-end reads

Hi,

I have multiple paired-end fastqs from a single biological sample that was prepared by three flowcells, two lanes each.
In short, I have

sample_flowcell1_lane1.R1.fastq.gz sample_flowcell1_lane1.R2.fastq.gz
sample_flowcell1_lane2.R1.fastq.gz sample_flowcell1_lane2.R2.fastq.gz
sample_flowcell2_lane1.R1.fastq.gz sample_flowcell2_lane1.R2.fastq.gz
sample_flowcell2_lane2.R1.fastq.gz sample_flowcell2_lane2.R2.fastq.gz
sample_flowcell3_lane1.R1.fastq.gz sample_flowcell3_lane1.R2.fastq.gz
sample_flowcell3_lane2.R1.fastq.gz sample_flowcell3_lane2.R2.fastq.gz

where R1, R2 are paired-end reads.

I'm trying to generate a single bam file from these fastqs with bwa mem and samtools on reference GRCh37
Then ultimately run whole exome sequencing with following procedure.
bwa_mem for each 6 sets of paired-end reads
samtools sort for each 6 generated bams
samtools merge -r for the 6 generated bams to produce a single bam

Then start the GATK process on the merged.bam
picard.jar AddOrReplaceGroups
picard.jar MarkDuplicates
picard.jar ReorderSam
GATK RealignerTargetCreator
GATK IndelRealigner
GATK Baserecalibrator
and so on ...

I am not sure what is the best way to merge these fastqs and generate a single bam.
Could you recommend me how I should generate a single bam from these fastqs?

↧

GATK best practices somatic mutations

September 7, 2016, 12:50 pm

≫ Next: GATK best practices pipelinee written in scala for Queue

≪ Previous: sample prepared from multiple flowcell and multiple lanes

Hi, I would like to know because in the actual best practices there is not the possibility to perform the analysis of the somatic mutations with RNA seq using STAR and mutect.
Thank you.

Riccardo

↧

GATK best practices pipelinee written in scala for Queue

August 25, 2016, 11:30 am

≫ Next: post-transcriptional modifications in RNA-seq base recalibration Best Practices?

≪ Previous: GATK best practices somatic mutations

Hello,

I tried to GATK haplotypeCaller, but it is too slow on my WGS data. I have been reading the GATK forums and am trying out Queue to make the pipeline faster. I have gotten as far as the HelloWorld.scala and CountingReads.scala, and reading about traits. I have used Java in the past, so I have an understanding of the basics here.

I see several threads about Queue, but am not sure which is the latest and most comprehensive. So decided to post the question here.

1) I would like to implement the following in Queue, and would like to know if there are any scala scripts already present for this workflow

Sort Bam
MArk duplicates
Add or Create Groups
Reorder Bam
Index Bam
GATK haplotypeCaller
GATK VariantFiltration

2) Till now I have been using a shell script that loops my workflow (that has the above steps) on every BAM file that I have. When using Queue, How should I implement the scatter-gather here ? I did not see any documentation for that. How can i get the Qscript to run of my BAM files in parallel ?
PS: I do not have a cloud or a cluster to work on. I am trying this out on my local Ubuntu machine.

↧

post-transcriptional modifications in RNA-seq base recalibration Best Practices?

October 17, 2016, 9:09 am

≫ Next: How does scatter-gather in WDL handle memory limits

≪ Previous: GATK best practices pipelinee written in scala for Queue

Are the authors of GATK concerned about Post-transcriptional Modifications to mRNA negatively impacting Base Recalibration and Variant Calling via GATK RNA-seq Best practices?

Reverse Transcriptases have difficulty correctly reading modified nucleotides. The Reverse Transcriptase may produce an error at a modified nucleotide when making the cDNA. Illumina will then read the resulting cDNA correctly (and give high quality score). Thus, even though the Illumina reads are correctly reporting the base in the cDNA (with high quality scores), it will be "wrong" compared to the reference, and not masked by dbSNP since it is only a Post-transcriptional modification. This will severely reduce the resulting empirical quality scores calculated by BaseRecalibrator.

For DNA-seq, BaseRecalibrator masks "--knownSites" of polymorphism when calculating empirical Quality scores. The "--knownSites" is usually a VCF from e.g. dbSNP.

In RNA-seq, do the authors of GATK recommend any kind of VCF with known Post-transcriptional modifications in mRNA?

↧

How does scatter-gather in WDL handle memory limits

October 14, 2016, 8:11 am

≫ Next: FireCloud Best Practice Methods

≪ Previous: post-transcriptional modifications in RNA-seq base recalibration Best Practices?

Specifically in a LSF usage scenario. Say I have 10 samples that I pipe into GATK via WDL's scatter gather pipeline, and resource limit so that 10 samples cannot be run at the same time. Would WDL try to force the 10 jobs through and therefore gets killed by the job scheduler, or does it run a subset and keep within the memory limit?

↧