“You complete me.”
– Tom Cruise in Jerry Maguire (1996)
On this blog, I regularly write about studies using whole genome data, see for example the hybrid genome of the Italian Sparrow or the search for migration genes in the Willow Warbler genome. Indeed, ornithology is entering the genomics era. But did you ever wondered about the ‘wholeness’ of these whole genomes?
In a recent paper in Molecular Ecology Resources, Valentina Peona, Matthias Weissensteiner and Alexander Suh explore the completeness of current avian genome assemblies. At the moment, there are more than 100 bird genomes available at Genbank and the Bird 10,000 Genomes Project (B10K) – which aims to sequence all extant bird species – has generated over 300 assemblies. That is a lot of data.
But how complete are these assemblies? To figure this out, the authors compared summary statistics of the genome assemblies with estimates of genome size based on flow cytometry. This analysis revealed that between 7 and 42 percent of bird genomes was missing.
Mind the Gaps
Where is all this missing data? Unlike your car keys, the location of this missing data can easily be deduced. Most genome assemblies are based on short read technologies, which sequence about 100 base pairs at a time. Next, these short reads need to be combined into contiguous sequences (‘contigs’) and linked contigs (‘scaffolds’). If the whole genome is the distance between Rome and Paris, then 150 bp would be the size of a smartphone (see figure below). That is a lot of smartphones you have to line up…
Scaffolds consist of contigs and assembly gaps (placeholders of undetermined ‘N’ nucleotides). This is where you can find the missing data. These assembly gaps mostly contain repetitive elements such as interspersed repeats (transposable elements and endogenous viruses) and tandem repeats (microsatellites and satellites). These sequences are difficult to assemble because of their repetitive nature: “Like a puzzle piece occurring multiple times in a single puzzle game.”
Long Reads Technologies
So, how can we deal with these gaps? New techniques that sequence long stretches of DNA (millions of base pairs) might provide the solution. Think of 10X Genomics Linked-Read sequencing, BioNano optical mapping and chromosome conformation capture (Hi-C). This is nicely illustrated by the recent developments in sequencing the Chicken genome: comparing the short-read genome (galGal4) with a more recent long-read genome revealed that the amount of missing data decreased from 14.1 to 2.4 percent. We are not there yet, but we are getting there.
Peona, V., Weissensteiner, M.H. & Suh, A. (2018) How complete are ‘complete’ genome assemblies? – An avian perspective. Molecular Ecology Resources. doi: 10.1111/1755-0998.12933