Efficient Assembly of Large Genomes

Shaun Jackman

Computational Biology, 10x Genomics
Vancouver, Canada
@sjackman · github.com/sjackman · sjackman.ca

Efficient Assembly
of Large Genomes

Introduction
ABySS 2.0
Tigmint
UniqTag
ORCA
Organellar genomes of white spruce
Mitochondrial genome of Sitka spruce
Genome assembly of western redcedar
Conclusion

Tigmint
BMC Bioinformatics
2018 doi.org/cwfh

ABySS 2.0
Genome Research
2017 doi.org/f9x8qp

ORCA
Bioinformatics
2019 doi.org/c4mw

Sitka Spruce Mitochondrion
bioRxiv
2019 doi.org/c4mv

White Spruce Organelles
Genome Biology and Evolution
2016 doi.org/f8bxck

UniqTag
PLOS ONE
2015 doi.org/c3m3

Publications

Five first-author (or joint) papers
One paper each year from 2015 through 2019
Collaborated on 32 papers since 2009
29 papers with at least 10 citations
ABySS has been cited over 2,900 times!

Short Read Genome Assembly

ABySS 1.0 (2009) was the first to assemble
a human genome from short reads (42 bp!)

de Bruijn graph assembler
Stored k-mers in a hash table
Distributed the hash table over many machines
Used MPI to aggregate sufficient memory
Assembles large genomes

Challenges

Uses lots of memory
Network communication is super slow
Message passing is also slow

Solution

A memory-efficient data structure
reduces memory usage
Fitting entire graph in a single machine
eliminates network communication
Using shared memory (OpenMP)
eliminates message passing (MPI)

ABySS 2.0 reduces the memory
usage of ABySS by ten fold.

Memory efficient de Bruijn graph using a Bloom filter
Memory usage is independent of k

Navigating a Bloom filter de Bruijn graph

Sequencing errors and Bloom filter false positives

Spruce genome assemblies

ABySS	1.3.5	2.0.0
Spruce species	Interior	Sitka
Machines	115	1
RAM (GB)	4,300	500
CPU cores	1,380	64
CPU time*	6.0 years	3.2 years

* Time of unitig assembly without scaffolding

Human: 42 Mbp NG50 with linked reads and BioNano

ABySS 2.0 Conclusions

ABySS 2.0 reduces memory usage by 10 fold
from 418 GB to 34 GB for human
from 4,300 GB to 500 GB for spruce
High-throughput short-read sequencing
combined with large molecule scaffolding
such as linked reads and optical mapping
permits cost effective assembly of large genomes

Linked Reads

Contigs and scaffolds
come to an end due to…

repeats
sequencing gaps
structural variation
misassemblies

Correct misassemblies

Scaffold

https://github.com/JustinChu/JupiterPlot

Human genome assembly (GIAB HG004 NA24143)

Assembly Tools	NGA50
ABySS 2.0	3 Mbp
ABySS 2.0 + ARCS	8 Mbp
ABySS 2.0 + Tigmint + ARCS	16 Mbp

Tigmint reduced misassemblies by 216 (27% reduction)

Corrects and improves long read assemblies too!

Sequencing	Nanopore	PacBio
Assembler	Canu	Falcon
NGA50 before	5.4 Mbp	4.2 Mbp
NGA50 after	10.9 Mbp	12.0 Mbp
Improvement	2.0 fold	2.9 fold

Tigmint Conclusions

Scaffolding after correcting with Tigmint yields an assembly both more correct and more contiguous

Linked reads permit cost-effective assembly of large genomes using high-throughput sequencing

Western redcedar (Thuja plicata)

Western redcedar (Thuja plicata) Range

Western Redcedar Methods

Conifer Assemblies

Year	Species	Scaffold N50
2018	Western redcedar	2,310 kbp
2017	Sugar pine²	2,510 kbp
2017	Douglas fir	341 kbp
2017	Loblolly pine²	108 kbp
2016	Sugar pine¹	247 kbp
2015	Interior white spruce²	83 kbp
2015	White spruce	20 kbp
2014	Loblolly pine¹	67 kbp
2013	Interior white spruce¹	20 kbp
2013	Norway spruce	5 kbp

¹initial assembly ²improved assembly

Efficient Assembly
of Large Genomes

Introduction
ABySS 2.0 (doi.org/f9x8qp)
Tigmint (doi.org/cwfh)
UniqTag (doi.org/c3m3)
ORCA (doi.org/c4mw)
Organellar genomes of white spruce (doi.org/f8bxck)
Mitochondrial genome of Sitka spruce (doi.org/c4mv)
Genome assembly of western redcedar
Conclusion

Think of each molecule of linked reads as a long read.

Can we assemble these molecules using
an overlap-layout-consensus approach
without first assembling the reads?

Physical Map of Linked Read Molecules

Overlap Layout Consensus

Each barcode of linked reads is a bag of k-mers
Keep only the minimizers of each read for efficiency
Reduce a hundred k-mers per read to five minimizers
Discard most frequent minimizers, likely repetitive
Count shared minimizers per pair of barcodes

Barcode Overlap Graph

Each barcode is a vertex
Each edge is the overlap between two barcodes
Edge weight is number of shared minimizers

Physlr contig of the Sitka spruce plastid (120 kbp)

Separate Molecules

We have the barcode overlap graph
but we want the molecule overlap graph
Separate each barcode into its component molecules
Look at the neighborhood graph of each barcode
(vertex-induced subgraph of its immediate neighbors)
Each community is one molecule

Neighborhood graph of one barcode with two molecules

Overlap Layout Consensus

A layout is a linear ordering of molecules
Find a path through the molecule overlap graph
Solve the traveling salesman problem
Optimal solution is NP-hard
Approximate solution is good enough
Start with a maximum spanning tree (MST)

Maximum spanning tree of fruit fly chr4 (1.35 Mbp)

Maximum Spanning Tree (MST)

Compute the maximum spanning tree
Prune short branches of the MST
Assemble contigs from simple non-branching paths
Inspired by MSTmap used for genetic linkage maps

MSTmap: Efficient and Accurate Construction of Genetic Linkage Maps from the Minimum Spanning Tree of a Graph
Wu et. al (2018) doi.org/d4sqs8

Physlr physical map of fruit fly (138 Mbp)

12.7 Mbp NG50, 25 chromosomes in 144 contigs
4.8 Mbp NG50 for Supernova

40.9 Mbp NG50, 23 chromosomes in 95 contigs
38.5 Mbp NG50 for Supernova

Scaling Up to Larger Genomes

Western redcedar (12 Gbp)

Sitka spruce (20 Gbp)

Overlap Layout Consensus

Scaffold by mapping contigs to the physical map
Targeted assembly of a chromosome, or a smaller region
Assemble the complete genome using multiple targeted assemblies

Photo by Martin Krzywinski

fin

Supplemental Slides

Physlr Run Time

Physlr Memory Usage

Physlr contig of fruit fly chr4 (1.35 Mbp)

First-author Publications

Largest Complete Mitochondrial Genome of a Gymnosperm, Sitka Spruce (Picea sitchensis), Indicates Complex Physical Structure
SD Jackman, L Coombe, RL Warren, H Kirk, E Trinh, T McLeod, S Pleasance, P Pandoh, Y Zhao, RJ Coope, J Bousquet, J Bohlmann, SJM Jones, I Birol
bioRxiv 2019
ORCA: A Comprehensive Bioinformatics Container Environment for Education and Research
SD Jackman, T Mozgacheva, S Chen, B O’Huiginn, L Bailey, I Birol, SJM Jones
Bioinformatics 2019
Tigmint: correcting assembly errors using linked reads from large molecules
SD Jackman, L Coombe, J Chu, RL Warren, BP Vandervalk, …
BMC Bioinformatics 2018
ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter
SD Jackman^*, BP Vandervalk^*, H Mohamadi, J Chu, S Yeo, SA Hammond, …
Genome Research 2017
Organellar genomes of white spruce (Picea glauca): assembly and annotation
SD Jackman, RL Warren, EA Gibb, BP Vandervalk, H Mohamadi, J Chu, …
Genome Biology and Evolution 2015
UniqTag: content-derived unique and stable identifiers for gene annotation
SD Jackman, J Bohlmann, I Birol
PLOS ONE 2015

Selected Publications

Assembly of the complete Sitka spruce chloroplast… L Coombe, RL Warren, SD Jackman, C Yang, BP Vandervalk, …, I Birol
PloS one 2016
Spaced seed data structures for de novo assembly
I Birol, J Chu, H Mohamadi, SD Jackman, K Raghavan, …, RL Warren
International journal of genomics 2015
Konnector v2.0: pseudo-long reads from PE sequencing
BP Vandervalk, C Yang, Z Xue, K Raghavan, J Chu, H Mohamadi, SD Jackman, …, I Birol
BMC medical genomics 2015
Sealer: a scalable gap-closing application…
D Paulino, RL Warren, BP Vandervalk, A Raymond, SD Jackman, I Birol
BMC Bioinformatics 2015
On the representation of de Bruijn graphs
R Chikhi, A Limasset, SD Jackman, JT Simpson, P Medvedev
Journal of Computational Biology 2015
Improved white spruce (Picea glauca) genome…
RL Warren, CI Keeling, MMS Yuen, A Raymond, GA Taylor, …, J Bohlmann
The Plant Journal 2015
Assembling the 20Gb white spruce genome…
I Birol, A Raymond, SD Jackman, S Pleasance, R Coope, …, SJM Jones
Bioinformatics 2013

ABySS 1.0

	Human	Spruce
Genome size	3 Gbp	20 Gbp
RAM	418 GB	4.3 TB
CPU cores	64	1,380
Wall time	14 hours	12 days
Year	2009 & 2017	2013
Short DOI	doi.org/f9x8qp	doi.org/f4zzrr

Solid reads are extended using the Bloom filter de Bruijn graph to assemble unitigs

ABySS 2.0 reduces memory usage by 10 fold vs ABySS 1.0 for human genome assembly (GIAB HG004 NA24143)

Spruce genome assemblies

ABySS	1.3.5	2.0.0
Spruce species	Interior	Sitka
Machines	115	1
RAM (GB)	4,300	500
CPU cores	1,380	64
CPU time* (years)	6.0	3.2
Wall time* (days)	1.6	18
Year	2013	2017
Short DOI	doi:f4zzrr	NA

* Time of unitig assembly without scaffolding

Contiguity and correctness are comparable

Tools for Linked Reads

Align linked reads
Lariat (Long Ranger) · EMA
Structural variants
Long Ranger · GROC-SVs · NAIBR · SVenX · Topsorter
Phase variants
Long Ranger
Genome sequence assembly
Supernova
Scaffolding
ARCS · Architect · Fragscaff · Scaff10x

https://github.com/johandahlberg/awesome-10x-genomics

Tigmint Method

Map reads to the assembly
Group reads within d bp of each other (d = 50 kbp)
Infer start and end coordinates of molecules
Construct an interval tree of the molecules
Each w bp region ought to be spanned by n molecules
(w = 1 kbp, n = 20)
Identify regions with fewer than n spanning molecules
Cut sequences at regions with insufficient coverage

Human genome assemblies (GIAB HG004 NA24143)

Note: Supernova used only linked reads, others PE+MP+LR.

Tigmint Time and Memory

bwa mem Map reads to assembly
5½ hours, 17 GB RAM, 48 threads
tigmint-molecule Group reads into molecules
3¼ hours, 0.08 GB RAM, 1 thread
tigmint-cut Identify misassemblies and cut sequences
7 minutes, 3.3 GB RAM, 48 threads

Western Redcedar Assembly

12.5 Gbp genome size estimated by flow cytometry
(Hizume et al. 2001 doi.org/d89svf)
9.8 Gbp genome size estimated by GenomeScope
8.0 Gbp assembled in scaffolds 1 kbp or larger

Western Redcedar BUSCO

60.4% of core single-copy genes present (BUSCO)

53.9% complete
6.5% fragmented
39.6% missing

Efficient Assembly of Large Genomes

Shaun Jackman

Efficient Assemblyof Large Genomes

Publications

Short Read Genome Assembly

Challenges

Solution

Spruce genome assemblies

ABySS 2.0 Conclusions

Linked Reads

Contigs and scaffoldscome to an end due to…

Tigmint Conclusions

Western redcedar (Thuja plicata)

Western Redcedar Methods

Conifer Assemblies

Efficient Assemblyof Large Genomes

Overlap Layout Consensus

Barcode Overlap Graph

Separate Molecules

Overlap Layout Consensus

Maximum Spanning Tree (MST)

Scaling Up to Larger Genomes

Western redcedar (12 Gbp)

Sitka spruce (20 Gbp)

Overlap Layout Consensus

fin

Supplemental Slides

Physlr Run Time

Physlr Memory Usage

First-author Publications

Selected Publications

ABySS 1.0

Spruce genome assemblies

Tools for Linked Reads

Tigmint Method

Tigmint Time and Memory

Western Redcedar Assembly

Western Redcedar BUSCO

Efficient Assembly
of Large Genomes

Contigs and scaffolds
come to an end due to…

Efficient Assembly
of Large Genomes