Sitka Spruce Mitochondrion bioRxiv
2019 doi.org/c4mv White Spruce Organelles Genome Biology and Evolution
2016 doi.org/f8bxck UniqTag PLOS ONE
2015 doi.org/c3m3
Publications
Five first-author (or joint) papers
One paper each year from 2015 through 2019
Collaborated on 32 papers since 2009
29 papers with at least 10 citations
ABySS has been cited over 2,900 times!
Short Read Genome Assembly
ABySS 1.0 (2009) was the first to assemble
a human genome from short reads (42 bp!)
de Bruijn graph assembler
Stored k-mers in a hash table
Distributed the hash table over many machines
Used MPI to aggregate sufficient memory
Assembles large genomes
Challenges
Uses lots of memory
Network communication is super slow
Message passing is also slow
Solution
A memory-efficient data structure
reduces memory usage
Fitting entire graph in a single machine
eliminates network communication
Using shared memory (OpenMP)
eliminates message passing (MPI)
ABySS 2.0 reduces the memory
usage of ABySS by ten fold.
Spruce genome assemblies
ABySS
1.3.5
2.0.0
Spruce species
Interior
Sitka
Machines
115
1
RAM (GB)
4,300
500
CPU cores
1,380
64
CPU time*
6.0 years
3.2 years
* Time of unitig assembly without scaffolding
ABySS 2.0 Conclusions
ABySS 2.0 reduces memory usage by 10 fold
from 418 GB to 34 GB for human
from 4,300 GB to 500 GB for spruce
High-throughput short-read sequencing
combined with large molecule scaffolding
such as linked reads and optical mapping
permits cost effective assembly of large genomes
Mitochondrial genome of Sitka spruce (doi.org/c4mv)
Genome assembly of western redcedar
Conclusion
Think of each molecule of linked reads as a long read.
Can we assemble these molecules using
an overlap-layout-consensus approach
without first assembling the reads?
Overlap Layout Consensus
Each barcode of linked reads is a bag of k-mers
Keep only the minimizers of each read for efficiency
Reduce a hundred k-mers per read to five minimizers
Discard most frequent minimizers, likely repetitive
Count shared minimizers per pair of barcodes
Barcode Overlap Graph
Each barcode is a vertex
Each edge is the overlap between two barcodes
Edge weight is number of shared minimizers
Separate Molecules
We have the barcode overlap graph
but we want the molecule overlap graph
Separate each barcode into its component molecules
Look at the neighborhood graph of each barcode
(vertex-induced subgraph of its immediate neighbors)
Each community is one molecule
Overlap Layout Consensus
A layout is a linear ordering of molecules
Find a path through the molecule overlap graph
Solve the traveling salesman problem
Optimal solution is NP-hard
Approximate solution is good enough
Start with a maximum spanning tree (MST)
Maximum Spanning Tree (MST)
Compute the maximum spanning tree
Prune short branches of the MST
Assemble contigs from simple non-branching paths
Inspired by MSTmap used for genetic linkage maps
MSTmap: Efficient and Accurate Construction of Genetic Linkage Maps from the Minimum Spanning Tree of a Graph
Wu et. al (2018) doi.org/d4sqs8
12.7 Mbp NG50, 25 chromosomes in 144 contigs
4.8 Mbp NG50 for Supernova
40.9 Mbp NG50, 23 chromosomes in 95 contigs
38.5 Mbp NG50 for Supernova
Scaling Up to Larger Genomes
Western redcedar (12 Gbp)
Sitka spruce (20 Gbp)
Overlap Layout Consensus
Scaffold by mapping contigs to the physical map
Targeted assembly of a chromosome, or a smaller region
Assemble the complete genome using multiple targeted assemblies
Photo by Martin Krzywinski
fin
Supplemental Slides
Physlr Run Time
Physlr Memory Usage
First-author Publications
Largest Complete Mitochondrial Genome of a Gymnosperm, Sitka Spruce (Picea sitchensis), Indicates Complex Physical Structure SD Jackman, L Coombe, RL Warren, H Kirk, E Trinh, T McLeod, S Pleasance, P Pandoh, Y Zhao, RJ Coope, J Bousquet, J Bohlmann, SJM Jones, I Birol bioRxiv 2019
ORCA: A Comprehensive Bioinformatics Container Environment for Education and Research SD Jackman, T Mozgacheva, S Chen, B O’Huiginn, L Bailey, I Birol, SJM Jones Bioinformatics 2019
Tigmint: correcting assembly errors using linked reads from large molecules SD Jackman, L Coombe, J Chu, RL Warren, BP Vandervalk, … BMC Bioinformatics 2018
ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter SD Jackman*, BP Vandervalk*, H Mohamadi, J Chu, S Yeo, SA Hammond, … Genome Research 2017
Organellar genomes of white spruce (Picea glauca): assembly and annotation SD Jackman, RL Warren, EA Gibb, BP Vandervalk, H Mohamadi, J Chu, … Genome Biology and Evolution 2015
UniqTag: content-derived unique and stable identifiers for gene annotation SD Jackman, J Bohlmann, I Birol PLOS ONE 2015
Selected Publications
Assembly of the complete Sitka spruce chloroplast… L Coombe, RL Warren, SD Jackman, C Yang, BP Vandervalk, …, I Birol PloS one 2016
Spaced seed data structures for de novo assembly
I Birol, J Chu, H Mohamadi, SD Jackman, K Raghavan, …, RL Warren International journal of genomics 2015
Konnector v2.0: pseudo-long reads from PE sequencing
BP Vandervalk, C Yang, Z Xue, K Raghavan, J Chu, H Mohamadi, SD Jackman, …, I Birol BMC medical genomics 2015
Sealer: a scalable gap-closing application…
D Paulino, RL Warren, BP Vandervalk, A Raymond, SD Jackman, I Birol BMC Bioinformatics 2015
On the representation of de Bruijn graphs
R Chikhi, A Limasset, SD Jackman, JT Simpson, P Medvedev Journal of Computational Biology 2015
Improved white spruce (Picea glauca) genome…
RL Warren, CI Keeling, MMS Yuen, A Raymond, GA Taylor, …, J Bohlmann The Plant Journal 2015
Assembling the 20Gb white spruce genome…
I Birol, A Raymond, SD Jackman, S Pleasance, R Coope, …, SJM Jones Bioinformatics 2013