The Picard Pipeline

Sequencing Pipeline Informatics at Broad

•  What is Picard?

•  Picard in context

•  What Picard produces and how to access it

•  Metrics, Metrics, Metrics

•  A set of tools for processing and analyzing next generation

sequencing data

– Many of which are released publicly

•  A set of pipelines that process all Illumina sequence data

generated at Broad

Extract

Illumina Data

to Standard

Format

Align reads

with BWA or

MAQ

Mark

Duplicate

Reads

Re-align

reads around

known indels

Calibrate

Quality

Scores

Collect

Metrics about

Libraries and

Run

Verify Sample

Identity

Run Triage

•  Adapter trimming/marking happens during data extraction from Illumina (information is

used during alignment)

•  Indexed runs are de-multiplexed during extraction and each index/sample processed

independently

•  Recalibration only performed for references with dbSNP

Aggregation

Pipeline

Merge all

data for each

library

Mark

Duplicate

Reads per

library

Collect

Metrics per

library

Merge all

libraries for a

sample

Collect

Metrics about

the Sample

•  A single BAM file is created per Sample (within the context of a project)

•  Aggregations are started after data is processed or re-processed through the run-level

pipeline (after a 12 hour “quiet period”)

•  Outdated aggregations are kept for 2 weeks after newer aggregations are completed

Downstream

pipelines and

analysts

Run Level

Pipeline

What

Where

Pipeline Outputs

/seq/picard/{flowcell}

Aggregation Outputs

/seq/picard_aggregation/{project}/{sample}

Picard Binaries

/seq/software/picard/current/bin

Metrics Documentation

http://iwww/~picard/picard_metric_definitions.html

Source Code

https://svn.broadinstitute.org/picard/trunk

https://picard.svn.sourceforge.net/svnroot/picard/trunk

•  And coming soon – BASS

•  Programmatic access to BAM files in BASS available

•  Web page to access BAM files in BASS under construction

•  All primary data is delivered in BAM format, which includes basecalls

(the reads), quality scores, alignment data, etc.

•  BAM files processed through Picard always contain all reads,

including:

–  All unaligned reads (marked as unmapped)

–  All duplicate reads (marked as duplicates)

–  All “non-PF” reads (marked as failing vendor quality)

•  The pipeline generates tons of metrics!

•  And we have tools to generate even more than run in the pipeline

•  Please check before re-inventing the wheel

Internal Control Metrics

Quality Calibration Data

Alignment Summary Metrics

GC Bias Metrics

Quality By Cycle

Quality Distribution

Duplication Metrics

Insert Size Metrics

Low Pass Concordance

Hybrid Selection Metrics

SNP Fingerprint

Jumping Library Metrics

dbSNP Concordance

Quality/Yield Metrics

Barcode Metrics

0 20 40 60 80 100

0.0 0.5 1.0 1.5 2.0

20AL7ABXX.1 GC Bias Plot

Total clusters: 124,301,048, Aligned reads: 206,073,969

GC% of 100 base windows

Fraction of normalized coverage

0 10 20 30 40

Mean base quality

−

Normalized Coverage

Windows at GC%

Base Quality at GC%

0 100 200 300 400 500

0 500000 1000000 1500000 2000000

20AL7ABXX.1.aligned.duplicates_marked.bam Insert Size Histogram

Insert Size

Count

0 20 40 60 80 100 120 140

0 10 20 30 40 50

209KJABXX.1.aligned.duplicates_marked.bam Quality By Cycle

Cycle

Mean Quality

Mean Original Quality

0 10 20 30 40

0.0e+00 5.0e+08 1.0e+09 1.5e+09 2.0e+09

209KJABXX.1.aligned.duplicates_marked.bam Quality Score Distribution

Quality Score

Observations

Quality Scores

Original Quality Scores

0 20 40 60

209KJABXX.1.unmapped.bam Total (n=481735) IC Error Rate by Cycle

Cycle

Error Rate (log axis)

0.001 0.010 0.100 1.000

•  Integrate GATK Unified Genotyper in single-sample mode

•  Customized pipeline for cDNA/RNA sequencing

•  Yet more sample identity/validity checking