Assembling the Genome in a Bottle sequencing data with ABySS

Shaun D Jackman

2016-06-06

Genome in a Bottle

Assembling the Genome in a Bottle sequencing data with ABySS

Shaun Jackman @sjackman

2016-06-06

Creative Commons Attribution License

Fork me on GitHub!

Shaun Jackman

BC Cancer Agency Genome Sciences Centre
Vancouver, Canada
@sjackman | github.com/sjackman | sjackman.ca

Genome in a Bottle Data

GIAB Data

  • 7 individuals
    • Pilot individual (NA12878)
    • Ashkenazim Trio
    • Chinese Trio
  • 13 sequencing technologies
    • Illumina
    • Ion Torrent
    • 10x Genomics
    • Pacbio
    • Oxford Nanopore
    • SOLiD
    • Complete Genomics
    • BioNano Genomics

Sequencing

  • Illumina 2x150 paired-end of 600 bp
  • Illumina 2x250 paired-end of 600 bp
  • Illumina 6 kbp mate-pair
  • Illumina Whole Exome
  • Ion Proton Exome
  • SOLiD
  • Moleculo
  • 10x Genomics GemCode
  • PacBio
  • Oxford Nanopore
  • Complete Genomics
  • Complete Genomics LFR
  • BioNano Genomics Irys

Assembly Pipeline

Assembly Pipeline

  • Inspect data
  • Trim adapters
  • Correct errors
  • Merge reads
  • Estimate k-mer abundance
  • Assemble
  • Scaffold
  • Close gaps
  • Polish
  • Align to reference
  • Calculate metrics
  • Count core genes
  • Generate reports

Inspect data

  • Check library and sequencing quality
  • Discard bad lanes
  • Trim bad cycles

Tools

Trim adapters

Remove adapter sequence from reads

Tools

Correct errors

Reduce memory utilization and improve contiguity and correctness

Tools

Merge reads

  • Merge overlapping reads
  • Fill the gap between paired-end reads

Tools

Estimate k-mer abundance

  • Estimate memory usage prior to assembly
  • Pick a range of values for k to assemble

Tools

Assemble

Assemble reads into contigs

Tools

Scaffold

Join contigs into scaffolds

Tools

Close gaps

  • Fill gaps of Ns with sequence
  • Improve the contig N50

Tools

Polish

Identify and remove errors from the assembly

Tools

Align to reference

Identify misassemblies

Tools

Calculate metrics

for contiguity, correctness, completeness

Tools

Count core genes

Estimate assembly completeness without a reference

Tools

Generate reports

Generate tables and figures of

contiguity, correctness, completeness

Tools

Results

Correct errors using BFC

  • Reduce memory utilization from 975 GB to 418 Gb
  • Improve scaffold NG50 from 3.7 Mbp to 4.9 Mbp
  • Improve correctness (? no data)

Contigs

Assemble contigs using ABySS
Assemble contigs using ABySS

Scaffolds

Scaffold using ABySS
Scaffold using ABySS

Contiguity vs correctness

Scaffold NGA50 vs breakpoints
Scaffold NGA50 vs breakpoints

fin

Shaun Jackman

BC Cancer Agency Genome Sciences Centre
Vancouver, Canada
@sjackman | github.com/sjackman | sjackman.ca