Tag Archives: Comparative Genomics

The Draft Genome Of Watermelon: Citrullus lanatus

The Cucurbitaceae is an agriculturally important family of plants (think melons, pumpkins, cucumbers, squashes, etc.) and one of the most popular species in this family is Watermelon.  Watermelon has been cultivated for more than 4,000 years and was most probably spread by nomadic people as a portable source of both water and pre-packaged nutrients.  The estimated center of diversity of the Cucurbits is in Southern Africa.  Watermelon has many cultivars – more than 200 in production worldwide – with a wide range of phenotypic diversity and a wide area of production that accounts for 7% of land grown for vegetables.


Unfortunately, Curcubits are generally susceptible to pathogens – most typically in the form of bacterial and fungal pathogens.  The genomes in this group are starting to pile up which makes the family an interesting group for comparative genomics studies –particularly in the development of model species for plant pathogen studies.

watermelon genome paper header

The recently published paper “The draft genome of watermelon (Citrullus lanatus) and resequencing of 20 diverse accessions” by Guo et al. in the journal Nature Genetics, described the draft genome for the Citrullus lanatus East Asian cultivar 97103 and then re-sequenced 20 different watermelon accessions – representing three different sub-species – in order to observe genetic diversity in wild.

Almost 47 Gb of sequence data was generated using Illumina’s sequencing platforms to give 108X coverage on the relatively small estimation of 426 Mb C. lanatus genome, while the draft is approximately 353 Mb or 83.2% of the estimated genome size.  Unmapped reads, totaling almost 20% of the sequencing data, could not accurately be constructed into contigs because of explicit regions of genome duplication.

watermelon figure 1

The authors estimated 23,440 genes in the watermelon genome – very close to both the cucumber genome (no surprise) and the human genome (surprise).  About 85% of the genes from watermelon could be predicted on the basis of homology to other plant genes.  The authors did a throughout assessment of transposable elements, various repeats, and classified functional RNAs from ribosomal RNA subunits to microRNAs.  Like other plants, watermelon shows gene enrichment in subtelomeric regions.  On the basis of comparison to other genome sequences, watermelon possesses the seven paleotriplications shared with the eudicots.

watermelon paper figure 2

The authors assessed genetic diversity across varieties of C. lanatus by sequencing 20 representative accessions anywhere between 5X and 16X coverage.  The estimated diversity of these accessions was considerably lower than similar arrays of accessions in maize, soybean, and rice.  One explanation of the disease susceptibility of the Cucurbitaceae is this low level of genetic diversity.  As a result, one objective of breeding programs for watermelon is to introduce more diversity from wild accessions.

watermelon paper figure 3

Lastly, the authors assessed a number of key features of the C. lanatus genome (along with the other Cucurbitaceae): vascular transport of water and nutrients along vine-like stems, sugar content and accumulation, and the presence of an interesting non-essential amino acid – originally described from watermelons – called Citrulline.

The watermelon genome database is located both here and here.

Galaxy Workshop and Community Conference, July 2012

Galaxy is a free web-based platform for bioinformatics and data mining initiated by some of my colleagues at Penn State’s Center for Bioinformatics and Genomics.  The are Galaxy platforms popping up all over the place: JGI, JCVI, and you can run it on your own desktop or computer cluster.  In case you’d like to gain hands on experience using the platform or want to learn more about setting up your own Galaxy platform you can attend the 2012 Galaxy Workshop and Community Conference:

The 2012 Galaxy Community Conference (GCC2012) will be held July 25-27 at the UIC Forum at University of Illinois Chicago.

GCC2012 will run for two full days, and be preceded by a full day of training workshops. GCC2012 will have things in common with previous meetings (see GDC 2010, GCC 2011), and will also incorporate new features, such as the training day, based on feedback we received after the 2011 conference.

GCC2012 is hosted by the University of Illinois at Chicago, the University of Illinois at Urbana-Champaign, and the Computation Institute.

The Medicago Genome Provides Insight Into the Evolution of Rhizobial Symbioses

Legumes are a very successful lineage of plants which have developed associations with soil microbes, most notably endosymbiotic nitrogen fixing bacteria.  Nitrogen fixation is found in specialized plant root structures called nodules.  Published online on November 16th in the journal Nature was the article “The Medicago genome provides insight into the evolution of rhizobial symbioses” by Young et al. (Another paper concerning the Medicago genome recently appeared in the journal PNAS).  Medicago truncatula, the plant sequenced in this paper, is related to the economically important crop alfalfa (Medicago sativa) and is a commonly used model plant to study above and below ground plant biology, most notably interactions with symbiotic microorganisms.

The Medicago genome (like most genomes) is still in the draft stage.  Through the use of bacterial artificial chromosomes (BACs) and direct sequencing of genomic DNA, the researchers estimate the genome of Medicago is upwards of 350 Mb in length.  As an estimation of the completeness of the M. truncatula genome, approximately 94% of expressed genes (as ESTs) map to the draft genome.  An estimated number of genes for M. truncatula is 62,388, with an average gene size of 2,211 base pairs per gene, and an average of 4 exons per gene.  These numbers seem to be in the same “ballpark”, or perhaps larger, than the genomes of Poplar, Rice, and Arabidopsis.

The sequencing of numerous plant genomes, including M. truncatula here, indicates a whole genome duplication event which occurred prior to the split of the rosids from the asteroids at approximately 150 million years ago.  Another whole genome duplication event occurred at approximately 60 million years ago in the Legumes, which yielded several subclades, with Medicago being placed in the Hologalegina clade.

Significant synteny is shared between Medicago and the genomes of other sequenced legumes, Glycine max and Lotus japonicus.  A common ancestor of the legumes underwent a whole genome duplication event, occurring approximately 58 million years ago, and as a result, specific euchromatic regions of Medicago share synteny with numerous regions in each of the Lotus and Glycine genomes, as well as other regions of the Medicago genome.  Additionally, due to a pre-Rosid whole genome duplication event, the genome of Medicago shows synteny to the grape genome in at least three elongated regions.

There has been a high rate of local gene duplication events – some by tandem duplication – in the Medicago genome, and these events are approximately three fold higher than Glycine and one and a half times greater than both Populus and Arabidopsis.  Gene duplication events in Medicago could explain the average to above average number of genes observed in the genome.  Based on the estimated time of origin for the legumes, Medicago has undergone synonymous substitutions at a rate almost twice that of the average rate of vascular plants.

Production of a specialized organ, the root nodule, in many members of the legumes is a trait with both ecological importance and human agricultural interest.  Through the structure of the root nodule, leguminous plants harbor anaerobic actinorhizal bacteria which are capable of fixing atmospheric nitrogen.  It appears that the trait of nodulation has evolved numerous times in the Fabales, and was reliant on whole genome duplication events which allowed the emergence of novel gene functions from redundant genes.

There are numerous plant genomic features present in the Legumes with regard to signaling with rhizobial microorganisms, such as nitrogen fixing bacteria and mycorrhizal fungi.  Duplicated genes have evolved roles in nodulation formation (the genes NFP and ERN1) and mycorrhizal colonization (the genes LYR1 and ERN2).  The researchers used RNA-Seq data from six different plant organs to differentiate gene expression of putative whole genome duplicated paralogs.  Not surprisingly for Medicago, roots had the highest amount of differential expression of paralogous genes, followed by flower, nodule, leaf, seed, and flower bud.  Transcription factors, putatively responsible for tissue differentiation in gene expression, were estimated to be 6% of all Medicago genes.

The Genomes of Two Thermophilic and Biomass-Degrading Fungi, Thielavia terrestris and Myceliophthora thermophila

One of the hurdles to the production of cellulosic biofuel is the economic breakdown plant biomass.  Currently, fungi used to break down plant biomass operate at, or slightly above, room temperature.  Chemical reactions at room temperature proceed slowly, are less efficient, and may be riddled with contaminating fungi which lower the efficiency of the breakdown process.  One scientific goal is to increase the heat in bioreactors with the hopes of speeding up the degradation using efficient fungal enzymes that operate at higher temperatures.

In an effort find thermostable fungal degradative enzymes, researchers have sequenced the genomes of two fungi, Thielavia terrestris and Myceliophthora thermophila, known for their ability to survive at high temperatures, namely 40oC to 75oC.  A report entitled “Comparative Genomic Analysis of the Thermophilic Biomass-Degrading Fungi Myceliophthora thermophila and Thielavia terrestris” has been published online on October 2nd in the journal Nature Biotechnology.  (Image: Myceliophthora thermophila link)

The 38.7 Mbp genome of M. thermophila and the 36.9 Mbp genome of T. terrestris are the first thermophilic eukaryotes to have their genomes sequenced, and contain seven and six complete chromosomes, respectively.  The genome of M. thermophila contains 9,110 protein-coding genes and there are 9,813 such genes in the genome of T. terrestris.  Both filamentous Ascomycetes – placed in the class Sordariomycetes and family Chaetomiaceae – have a similar level of genomic organization, barring numerous translocations and transversions.  When considering the three species with sequenced genomes in the Chaetomiaceae, large portions of the genomes, some of which are greater than 6000 contiguous genes, are shared in syntenous blocks.

Enzymes for the breakdown of plant matter – which can include a wide array of materials from agricultural and forestry waste, recycled pulp and paper products, leaves, etc. – were discovered across the genomes of both T. terrestris and M. thermophila.  These enzymes include numerous carbohydrate-active proteins (CAZymes) which include enzymes in the glycoside hydrolase, polysaccharide lyase, carbohydrate esterase, and glycosyl transferase families.  With some slight differences in regard to the breakdown of specific plant polysaccharides, such as pectin, both fungi can be categorized as general decomposers with regards to their enzyme repertoire.

The researchers then tested the expression of some enzymes identified in these newly sequenced fungal genomes, as well as comparing their diversity to well characterized enzymes from Trichoderma reesei.  Differing from T. reesei, both M. thermophila and T. terrestris have exhibited a proliferation in the GH61 enzyme family, responsible for the degradation of plant cell wall polysaccharides, as well as the GH10 and GH11 xylanase gene families.  The researchers used RNA-Seq to compare the expression of these enzymes on differing plant materials, such as alfalfa and barley straw, which represented characteristic dicot and monocot plants, respectively.  While there are noticeable differences to the degradation of plant material from dicots and monocots by both T. terrestris and M. thermophila, orthologs from both fungal genomes show similar patterns of gene expression, particularly when growing on complex plant substrates.

Research commentaries on this publication can be found here and here.

7th Annual Joint Genome Institute Users Meeting 2012

Recently announced, the Joint Genome Institute – US Department of Energy is planning to have their annual meeting in Walnut Creek, California, during the dates of March 20th to 22nd.  Registration is now open.  This should be another great meeting and includes another impressive array of speakers.

Genome Sequence of the Date Palm

Published in the June 2011 issue of the journal Nature Biotechnology was a paper reporting on the genome sequence of the data palm, Phoenix dactylifera.  This paper, authored by Al-Dous et al., addressed the genome sequencing and de novo assembly of this agriculturally important monocot tree, along with comparative genomics with other plants.

Dates have been found in the tombs of pharaohs estimated at 8,000 years old.  Fields of agriculturally planted trees, estimated to be older than 5,000 years, suggest the date palm is one of the oldest cultivated plants in the world.  Dates are the most important agricultural crop in the hot and arid regions surrounding the Arabian Gulf and their global production is close to 7 million tons yearly.

Despite a prolonged emphasis on their agriculture, there are a few problems to deal with if you are a date grower.  Typical of tree crops, there is a long generation time from seedling to fruit harvesting.  Additionally, only the female date palm provides fruit and it takes at least 5 years after seed germination to tell if you have a male or female plant.  To make it even harder for a date grower, there are more than 2000 date varieties, each exhibiting its own color, flavor, size, shape, and ripening schedule, and they are all really hard to keep track of based on conventional techniques.

In an effort to provide genetic resources for date growers and breeders, the authors of this study – who were mainly located in Qutar – sequenced and assembled 380 Mb of the estimated 658 Mb genome of the Khalas cultivar, which is known for high fruit quality.  Generated using short reads from the Illumina Genome Analyzer IIx platform, this partial sequence excluded numerous large repeated regions, includes a predicted 28,890 genes, and represented 18 pairs of chromosomes.  The authors estimate that this draft genome represents roughly 90% of the total genes and 60% of the total genome.

This genome resource also serves a comparative genomics purpose by being the first member of the widespread monocot order Arecales.  To this date, the only Monocots with sequenced genomes – for example: Corn, Rice, and Sorghum – have all been in the grass order, the Poales.

This report is missing some vital information: in addition to an incomplete genome assembly, there is no metabolic, developmental, or gene network pathway reconstruction for the date palm provided in this paper (and unfortunately this paper also includes some glaring typos in the citation section).  In place of these expected analyses, the authors conducted a throughout survey of SNPs in this Khalas cultivar, along with eight additional cultivars common in breeding programs for the date palm.  Within these nine cultivars, 3,518,029 SNPs were determined, but quite interestingly, a total of 32 SNPs could be used to differentiate the cultivars.

In addition to the throughout SNP analysis, the researchers then did a full parentage analysis of the cultivars used in this study, which includes the famous date varieties such as Deglet Noor, Dayri, and Medjool.  Here‘s an article in Nature Middle East on the importance of understanding this parentage and gender analysis.

Although this is a draft genome still being completed and undergoing resequencing, namely the tools provided by the authors, the SNP and parentage analysis, should provide date palm breeders with many resources for improved fruit quality and this genome represents an exciting piece of the monocot evolutionary puzzle.

Horizontal Gene Transfer In Ascomycete Fungi

Horizontal Gene Transfer (HGT) goes against what we typically consider the normal transfer of genetic material from parent to offspring.  HGT involves the transfer of genetic material from one organism to another.  Within the bacteria, whose mode of survival typically depends on phagocytosis, there is a fairly amount of HGT.  Events of HGT have been rarely observed in Eukaryotes because numerous barriers exist to prevent foreign nucleotides from entering a cell’s nucleus.  Some of these barriers in the Fungi include a substantial cell wall made of chitin, multiple cell and nuclear membranes to cross, and the secretion of metabolic enzymes to the outside of the cells and subsequent uptake of the nutrients.  Despite these barriers, there is now evidence of multiple occurrences of HGT in the fungi.

In a recent article published in the journal Current Biology, Jason Slot and Antonis Rokas, both of Vanderbilt University, provided evidence of HGT in two Ascomycete clades.  In this study, the authors identified a 23-gene cluster from the genus Aspergillus which relocated to the genus Podospora.  Genes that are in this cluster synthesize the toxic compound, Sterigmatocystin, which is a precursor to aflatoxins, noted for their production in Aspergillus.  Both genera are located in the subphylum Pezizomycotina, so each clade is not distantly related, but HGT was observed using different methods.

While it’s easy to observe genetic material passed from generation to generation, recognizing HGT is a little more difficult.  The main way the researchers have identified HGT is using phylogenetic methods to identify gene clusters whose homology cannot be explained by lineage alone.

Thomas Richards points out in his commentary on the Slot & Rokas paper (also in Current Biology), that because fungi do not phagotrophically consume their food they are less likely to incur HGT event.  There are two notable hypotheses to why we do see HGT in the fungi.  First, many secondary pathway genes in Eukaryotes are encoded in gene clusters, and the fungi have a fair amount of these clusters.  Gene clusters, which are more functional in a natural selection sense, are therefore more likely to persist upon transmission, as opposed to individual genes.  Data from HGT studies in fungi support this hypothesis.  Second, fungi are naturally, from the basis of their biology and natural history, intimately tied to other organisms, and fulfill roles as saprobes, pathogens, or symbionts.  This close intimacy increases the opportunity for genes to transfer from one organism to another.  Data suggests that this hypothesis is true also, as many of the recorded instances of HGT in fungi have been observed in organisms with overlapping environments.