I found the Li et al. paper – “Structural Variation in Two Human Genomes Mapped by Whole Genome de novo Assembly” – published in the August issue of Nature Biotechnology interesting for a number of reasons. As someone mainly interested in fungal and plant genomics this paper is somewhat outside my research focus, but I found both the novel approach to de novo genome assembly and the emphasis on structural genome variation over single nucleotide polymorphisms (SNPs) in explaining genetic diversity to be very interesting.
By using short read sequencing technology from the Illumina platform, the researchers began by sequencing the genomes of two individuals, one person of African descent (NA18507) and one of Asian descent (YH). As with many genome sequencing studies, there were numerous problems during the assembly process, such as alignment accuracy, recovery of long contiguous stretches of nucleotides, stretches of low or no coverage, and identifying sequencing background noise. The authors tried to eliminate these issues by developing a strategy focusing on de novo assembly instead of mapping reads to reference genomes.
The novel pipeline was able to identify structural variants – such as insertions, deletions, rearrangements, inversions, etc. – in each of the homozygous assembled genomes, some of which were upwards of 23,000 base pairs in length. The researchers then validated the structural variations using both experimental and computational methods, and, using data generated for the 1000 Human Genomes Project, they mapped their identified structural variations in the genomes of 106 other individuals.
While SNPs are easier to observe (perhaps the reasons why they have been emphasized so much in recent years?) it seems that structural rearrangements are perhaps the major form of variation in human genomes, and maybe, all genomes. Structural variations were less common than SNPs, but are more individual specific and appear to be associated with phenotypic characteristics. A next research direction would be to observe the association of structural variations to disease traits or susceptibility.
This paper also suggests that accurately assembling long genomic regions are very important to understanding structural variation. This can be accomplished by either using technologies that naturally generate longer reads (i.e. Sanger or PacBio sequencing) or ensuring that short reads can be accurately assembled by computational methods.