Efficient Assembly of Large Genomes

Shaun Jackman

Genome Sciences Centre, BC Cancer, Vancouver, Canada
@sjackman · github.com/sjackman · sjackman.ca

Efficient Assembly
of Large Genomes

Introduction
ABySS 2.0
Tigmint
UniqTag
ORCA
Organellar genomes of white spruce
Mitochondrial genome of Sitka spruce
Genome assembly of western redcedar
Conclusion

Sitka Spruce Mitochondrion
Submitted
2019 doi.org/c4mv

ORCA
Bioinformatics
2019 doi.org/c4mw

Tigmint
BMC Bioinformatics
2018 doi.org/cwfh

ABySS 2.0
Genome Research
2017 doi.org/f9x8qp

White Spruce Organelles
Genome Biology and Evolution
2016 doi.org/f8bxck

UniqTag
PLOS ONE
2015 doi.org/c3m3

Short Read Genome Assembly

ABySS 1.0 (2009) was the first to assemble
a human genome from short reads (42 bp!)

de Bruijn graph assembler
Stored k-mers in a hash table
Distributed the hash table over many machines
Used MPI to aggregate sufficient memory
Assembles large genomes

Challenges

Uses lots of memory
Network communication is super slow
Message passing is also slow

Solution

A memory-efficient data structure
reduces memory usage
Fitting entire graph in a single machine
eliminates network communication
Using shared memory (OpenMP)
eliminates message passing (MPI)

ABySS 2.0 reduces the memory
usage of ABySS by ten fold.

Memory efficient de Bruijn graph using a Bloom filter Memory usage is independent of k — Memory efficient de Bruijn graph using a Bloom filter
Memory usage is independent of k

Navigating a Bloom filter de Bruijn graph

Sequencing errors and Bloom filter false positives

Spruce genome assemblies

ABySS	1.3.5	2.0.0
Spruce species	Interior	Sitka
Machines	115	1
RAM (GB)	4,300	500
CPU cores	1,380	64
CPU time*	6.0 years	3.2 years

* Time of unitig assembly without scaffolding

Human: 42 Mbp NG50 with BioNano optical mapping

ABySS 2.0 Conclusions

ABySS 2.0 reduces memory usage by 10 fold
from 418 GB to 34 GB for human
from 4,300 GB to 500 GB for spruce
High-throughput short-read sequencing
combined with large molecule scaffolding
such as linked reads and optical mapping
permits cost effective assembly of large genomes

Linked Reads

Contigs and scaffolds
come to an end due to…

repeats
sequencing gaps
structural variation
misassemblies

Correct misassemblies

Scaffold

https://github.com/JustinChu/JupiterPlot

Human genome assembly (GIAB HG004 NA24143)

Assembly Tools	NGA50
ABySS 2.0	3 Mbp
ABySS 2.0 + ARCS	8 Mbp
ABySS 2.0 + Tigmint + ARCS	16 Mbp

Tigmint reduced misassemblies by 216 (27% reduction)

Corrects and improves long read assemblies too!

Sequencing	Nanopore	PacBio
Assembler	Canu	Falcon
NGA50 before	5.4 Mbp	4.2 Mbp
NGA50 after	10.9 Mbp	12.0 Mbp
Improvement	2.0 fold	2.9 fold

Tigmint Conclusions

Scaffolding after correcting with Tigmint yields an assembly both more correct and more contiguous

Linked reads permit cost-effective assembly of large genomes using high-throughput sequencing

Western redcedar (Thuja plicata)

Western redcedar (Thuja plicata) Range

Western Redcedar Methods

Conifer Assemblies

Year	Species	Scaffold N50
2018	Western redcedar	2,310 kbp
2017	Sugar pine²	2,510 kbp
2017	Douglas fir	341 kbp
2017	Loblolly pine²	108 kbp
2016	Sugar pine¹	247 kbp
2015	Interior white spruce²	83 kbp
2015	White spruce	20 kbp
2014	Loblolly pine¹	67 kbp
2013	Interior white spruce¹	20 kbp
2013	Norway spruce	5 kbp

¹initial assembly ²improved assembly