Mind the gaps: The incompleteness of complete avian genome assemblies

“You complete me.”

– Tom Cruise in Jerry Maguire (1996)

On this blog, I regularly write about studies using whole genome data, see for example the hybrid genome of the Italian Sparrow or the search for migration genes in the Willow Warbler genome. Indeed, ornithology is entering the genomics era. But did you ever wondered about the ‘wholeness’ of these whole genomes?


Genome Assemblies

In a recent paper in Molecular Ecology ResourcesValentina Peona, Matthias Weissensteiner and Alexander Suh explore the completeness of current avian genome assemblies. At the moment, there are more than 100 bird genomes available at Genbank and the Bird 10,000 Genomes Project (B10K) – which aims to sequence all extant bird species – has generated over 300 assemblies. That is a lot of data.

But how complete are these assemblies? To figure this out, the authors compared summary statistics of the genome assemblies with estimates of genome size based on flow cytometry. This analysis revealed that between 7 and 42 percent of bird genomes was missing.


The B10K Project aims to sequence the genomes of all living bird species (from: https://b10k.genomics.cn/)


Mind the Gaps

Where is all this missing data? Unlike your car keys, the location of this missing data can easily be deduced. Most genome assemblies are based on short read technologies, which sequence about 100 base pairs at a time. Next, these short reads need to be combined into contiguous sequences (‘contigs’) and linked contigs (‘scaffolds’). If the whole genome is the distance between Rome and Paris, then 150 bp would be the size of a smartphone (see figure below). That is a lot of smartphones you have to line up…


A nice representation of the differences in scale of several sequencing techniques (from Peona et al. 2018 Molecular Ecology Resources).

Scaffolds consist of contigs and assembly gaps (placeholders of undetermined ‘N’ nucleotides). This is where you can find the missing data. These assembly gaps mostly contain repetitive elements such as interspersed repeats (transposable elements and endogenous viruses) and tandem repeats (microsatellites and satellites). These sequences are difficult to assemble because of their repetitive nature: “Like a puzzle piece occurring multiple times in a single puzzle game.”


Long Reads Technologies

So, how can we deal with these gaps? New techniques that sequence long stretches of DNA (millions of base pairs) might provide the solution. Think of 10X Genomics Linked-Read sequencing, BioNano optical mapping and chromosome conformation capture (Hi-C). This is nicely illustrated by the recent developments in sequencing the Chicken genome: comparing the short-read genome (galGal4) with a more recent long-read genome revealed that the amount of missing data decreased from 14.1 to 2.4 percent. We are not there yet, but we are getting there.


Two chicks exploring a map of their genome (from: http://www.wikipedia.com/)


Peona, V., Weissensteiner, M.H. & Suh, A. (2018) How complete are ‘complete’ genome assemblies? – An avian perspective. Molecular Ecology Resources. doi: 10.1111/1755-0998.12933

3 thoughts on “Mind the gaps: The incompleteness of complete avian genome assemblies

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s