“You complete me.”
– Tom Cruise in Jerry Maguire (1996)
On this blog, I regularly write about studies using whole genome data, see for example the hybrid genome of the Italian Sparrow or the search for migration genes in the Willow Warbler genome. Indeed, ornithology is entering the genomics era. But did you ever wondered about the ‘wholeness’ of these whole genomes?
Genome Assemblies
In a recent paper in Molecular Ecology Resources, Valentina Peona, Matthias Weissensteiner and Alexander Suh explore the completeness of current avian genome assemblies. At the moment, there are more than 100 bird genomes available at Genbank and the Bird 10,000 Genomes Project (B10K) – which aims to sequence all extant bird species – has generated over 300 assemblies. That is a lot of data.
But how complete are these assemblies? To figure this out, the authors compared summary statistics of the genome assemblies with estimates of genome size based on flow cytometry. This analysis revealed that between 7 and 42 percent of bird genomes was missing.

The B10K Project aims to sequence the genomes of all living bird species (from: https://b10k.genomics.cn/)
Mind the Gaps
Where is all this missing data? Unlike your car keys, the location of this missing data can easily be deduced. Most genome assemblies are based on short read technologies, which sequence about 100 base pairs at a time. Next, these short reads need to be combined into contiguous sequences (‘contigs’) and linked contigs (‘scaffolds’). If the whole genome is the distance between Rome and Paris, then 150 bp would be the size of a smartphone (see figure below). That is a lot of smartphones you have to line up…

A nice representation of the differences in scale of several sequencing techniques (from Peona et al. 2018 Molecular Ecology Resources).
Scaffolds consist of contigs and assembly gaps (placeholders of undetermined ‘N’ nucleotides). This is where you can find the missing data. These assembly gaps mostly contain repetitive elements such as interspersed repeats (transposable elements and endogenous viruses) and tandem repeats (microsatellites and satellites). These sequences are difficult to assemble because of their repetitive nature: “Like a puzzle piece occurring multiple times in a single puzzle game.”
Long Reads Technologies
So, how can we deal with these gaps? New techniques that sequence long stretches of DNA (millions of base pairs) might provide the solution. Think of 10X Genomics Linked-Read sequencing, BioNano optical mapping and chromosome conformation capture (Hi-C). This is nicely illustrated by the recent developments in sequencing the Chicken genome: comparing the short-read genome (galGal4) with a more recent long-read genome revealed that the amount of missing data decreased from 14.1 to 2.4 percent. We are not there yet, but we are getting there.

Two chicks exploring a map of their genome (from: http://www.wikipedia.com/)
References,
Peona, V., Weissensteiner, M.H. & Suh, A. (2018) How complete are ‘complete’ genome assemblies? – An avian perspective. Molecular Ecology Resources. doi: 10.1111/1755-0998.12933
[…] genome assemblies that have only recently became available. Indeed, most bird genome assemblies are far from complete. Using the latest technologies in genome sequencing, the researchers managed to generate […]
[…] are difficult to assemble). Similarly, many avian genome assemblies contain significant gaps (see this blog post for more details). However, scientists can still extract a lot of information from incomplete […]
[…] could not be assigned to a particular genomic region. Given that most avian genome assemblies are far from complete, it could be that this variant – known as WW2 – resides in a difficult-to-assemble […]