Shaun Jackman
@sjackman | github.com/sjackman | sjackman.ca
BC Cancer Agency Genome Sciences Centre
Vancouver, Canada
With many thanks to Rayan Chikhi for sharing his slides
Comparative analyses, like gene synteny
A complete genome
High-coverage PacBio data
Gene content and SNPs
Draft quality assembly
A couple of Illumina libraries
Conclusion of the GAGE benchmark:
In terms of assembly quality, there is no single best assembler.
One definition of an assembly
(a trickier question than it seems)
A set of sequences that approximates the original sequenced material.
A set of sequences that explains the sequencing reads.
Any sequence that comes out of the sequencer
Two reads from a fragment less than 1000 bp
Two reads from a fragment larger than 1000 bp
A read longer than 1000 bp
A k bp subsequence of a read
Contigs before collapsing heterozygous variants
An assembled sequence with no gaps
An assembled sequence with gaps (“N”s)
Contigs extracted from scaffolds
A graph is composed of
Each edge connects two vertices
To make good choices, an assembler needs to find all overlapping reads.
Two types of assembly graphs
Someone who understands assembly graphs can intuit
A de Bruijn graph is a special case of an overlap graph.
A single read and k=3
ACTG
ACT -> CTG
Many reads and k=3
ACTG
CTGC
TGCC
ACT -> CTG -> TGC -> GCC
What happens if we add duplicate reads?
ACTG
ACTG
CTGC
CTGC
CTGC
TGCC
TGCC
ACT -> CTG -> TGC -> GCC
How does a sequencing error at the end of a read impact the de Bruijn graph?
ACTG
CTGC
CTGA
TGCC
What is the effect of a single-nucleotide variant (SNV) on the graph? (or a sequencing error in the middle of a read)
AGCATGA
AGCCTGA
AGC -> GCA -> CAT -> ATG -> TGA
AGC -> GCC -> CCT -> CTG -> TGA
What is the effect a small repeat on the graph?
AAACTGTCTGATTT
AAA -> AAC -> ACT -> CTG -> TGT -> GTC -> TCT -> CTG -> TGA -> GAT -> ATT -> TTT
Typically
OLC: Overlap, Layout, Consensus
Comparing assemblies is not simple.
There’s a trade off between
Tools: QUAST and BUSCO (CEGMA)
TACAGT
CAGTC
AGTCA
CAGA
How many k-mers are in these reads for k=3 (including duplicates)?
How many distinct k-mers are in these reads for k=3? k=5?
Given these reads come from the genome TACAGTCAGA, what is the largest k such that the set of k-mers in the genome is identical to the set of k-mers in the reads above?
TACAGT
CAGTC
AGTCA
CAGA
How many k-mers are in these reads for k=3 (including duplicates)? 12
How many distinct k-mers are in these reads?
7 for k=3, and 4 for k=5
Given these reads come from the genome TACAGTCAGA, what is the largest k such that the set of k-mers in the genome is identical to the set of k-mers in the reads above?
k=3. For k=4, TCAG does not appear in the reads
per read of length l is
l − k + 1
(including duplicate k-mers)