Tag Archives: Next-Generation Sequencing

A Genome Sequence for Tomato

The average person in the United States eats more than 10 kilograms of tomatoes a year – underscoring the fact that the fruit is one of the most important plant crops in cultivation.  To improve taste, texture, and disease resistance – just to name a few traits – a large consortium of researchers has initiated and provided a draft tomato genome.  In fact, the research consortium has published the genome sequence from two varieties of tomatoes: the domesticated inbred Solanum lycopersicum strain Heinz 1706 – the variety famous for ketchup – and the wild breeding Peruvian ancestor, Solanum pimpinellifolium.

The consortium published the draft genome sequences with a paper entitled “The tomato genome sequence provides insights into fleshy fruit evolution” in the journal Nature.  The consortium started sequencing the genome officially in 2003, but heterozygosity and duplication events made assembling the genome difficult.  The tomato genome is approximately 900 Mb – smaller than the Human genome – but certainly not small by eukaryotic standards.  Genetically and phenotypically diverse, the genus Solanum is one of the largest in the angiosperms.

The genomes of Solanum lycopersicum and S. pimpinellifolium only show 0.6% divergence and there is evidence of recent hybridization between the two species.  Both species show approximately 8% genome divergence compared against close relative potato, Solanum tuberosum.  Across the genus Solanum there has been two genome triplications with subsequent gene loss: one genome triplication is ancient and shared with all the rosid clade and another triplication is shared within the Solanaceae, which appear to be highly syntenic across the family.  The genomes were completed with both Sanger- and Illumina-derived sequences and assembled with the help of physical and genetic maps developed from a long history of tomato breeding efforts.

There are 34,727 and 35,004 genes identified across the genomes of Solanum lycopersicum and S. pimpinellifolium respectively.  These findings are similar to other plant genomes as 8,615 of these genes are found to be common to tomato, potato, rice, grape, and Arabidopsis.  Expression was assessed by replicated RNA-Seq of root, leaf, flower, and fruit tissues.  A total of 18,320 orthologous gene pairs were found in tomato and potato indicating diversifying selection between the two species of Solanum.

The consortium specifically compared tomato to grape in this study, as grape and tomato shared a common ancestor at approximately 100 million years ago, before the first whole genome triplication event that preceded the rosid-asterid divergence.  Additionally, both grape and tomato have similar molecular fruit maturation mechanisms.  When comparing the genomes of tomato and grape, approximately 73% of gene models are orthologous.  By estimating genome triplication events, the researchers conclude that the genome duplication event within the Solanaceae occurred roughly 71 million years ago and approximately 7 million years prior to the tomato-potato divergence.

Having a draft genome sequence is an important mechanism to understanding the molecular biology of the tomato plant.  Genome duplication events gave rise to the diversification of genes responsible for enhanced fruit physiological and chemical development – such as lycopene synthesis – and include photoreceptors and transcription factors that influence fruit ripening.  Additionally, tomato has had a contraction in the number of gene families associated with toxic alkaloid synthesis – the chemical hallmarks of many members of the Solanaceae.  One interesting question not answered by this research is the genomic mechanism by which the tomato regulates nutrient investment in above-ground fruits while the potato regulates starch investment in below-ground tubers.

These two tomato genomes, along with the genomes of fellow Nightshades completed or in the works (potato, pepper, tobacco, petunia, eggplant, etc.), will help breeders to develop traits desired by producers, like long shelf life, and fruit quality traits desired by tomato-consumers, such as taste, color, and texture.  In addition to these benefits, the draft tomato genomes will provide insights into the biology and nutrition of the Solanaceous plants, and provide more information for comparative genomics within this important economic group of plants.

First International Conference of Genomics in the Americas, Philadelphia, 2012

I haven’t been sure what to make of the BGI/ICG series of meetings thus far (see here and here), but at least the first foray into the Americas – The First International Conference of Genomics in the Americas, to be held September 27th to 28th 2012 in Philadelphia – looks to be interesting.  There is a good group of speakers lined up for this meeting, even though the BGI have rightfully taken some criticism for skewing the speaker list a little on the male-side of the genomics spectrum.  Registration is open now.

MSU Next-Generation Sequencing Analysis Workshop (Computer Summer Camp) 2012

I returned more than a week ago from Titus Brown’s two-week “Next Generation Sequence Data Analysis Workshop” at Michigan State University’s Kellogg Biological Station on Gull Lake in Hickory Corners.  There’s really so much to say about this course and I can’t quite possibly cover it all here.  I’m also still processing it all in my head and will probably be doing this for quite some time.

The course was a rigorous two-week workshop on current “next-generation” sequencing technologies and methods for data analysis.  You could describe it as an advanced bioinformatics boot camp – exercises focused on text manipulation and data processing at the command line, Python scripting, and statistical analysis using the R language.  You can find more information about the course, including all the tutorials, at Titus’s ANGUS website.

Titus, Ian Dworkin, and Istvan Albert instructed this intensive workshop – which sometimes consisted of ten hours of lectures and exercises a day.  Guest speakers/lecturers included Corbin Jones (Sequencing technologies and data analysis), Erich Schwarz (Genome Assembly and WormBase), and Julian Catchen (RAD sequencing and the Stacks program for population genetics from next-generation sequencing data).

We implemented Amazon Web Services (AWS) to create virtual UNIX machines.  This standardized our exercises and also conveniently made the course equal opportunity for users of all operating systems.  Course participants could use the virtual machines for the workshop tutorials or to begin analysis of their own data – the only requirement was to bring a laptop to access the internet.  Amazon provided a learning grant to the instructors for all the course participants to use their Elastic Cloud Computing (EC2) service for tutorials and data analysis.

I can see lots of benefits to both teaching and conducting research using disk images housed in the cloud.  One such benefit is that all the students in the course – as well as anyone, for example, who might want to replicate your publication data analyses – can use an identical disk image you provide.  This eliminates any issues a user may experience from differences in operating systems, program dependencies, and processing capabilities.

The UNIX–based workshop participants (who were using Mac OS, Fedora, and Ubuntu) were able to ssh using their terminals while the DOS users used PUTTY to login into AWS.  The command line was the focus of the course – obviously this is the way that programmers and bioinformaticians do things and this workshop was a confidence building crash course at the command prompt.  Also relying on just the command line helped reduce band-width as we had close to 30 laptops using the wireless connection in the room.

The Python workshop exercises were implemented using the newest version of iPython notebook, a platform for utilizing the iPython toolkit as a browser-based notebook remotely from our EC2 instances.  While there were minor hic-cups from the iPython Notebook platform (see Titus’s blog post), I was extremely impressed with the power of programming in python from this interface, so much that I immediately tried installing iPython notebook on my computer (and when I had some issues with matplotlib dependencies in my native installation, I then opted for the educational Enthought version).  Especially for my level of proficiency with programming in Python, this platform is fantastic for the interactive in-line visualization, the handiness and speed of de-bugging line-by-line, and the overall ease of use – it just makes me want to be programming in python.  I’ve already spent more time programming in Python in the short time I’ve returned from the course than I have prior to the course.

The workshop also focused on the statistical programming language R to analyze and plot RNA-Seq based data from Illumina sequencing.  We focused on using R at the command line, but I personally have used R in the RStudio platform for a few months now and would recommend this interface.  In addition to Python, the R language and all the great add-on packages are amazing for graphical representation of data.

We spent a couple of afternoons working on raw text manipulation using awk (see bioawk!), sed, and other UNIX commands.  One of my favorite nights was when the instructors went to the computer to use tools they had never used before, with scripting and command line help from the other instructors.

Probably the best aspect of the course was the intimacy and interactions of all the students, instructors, and teaching assistants.  We basically were together as a group everyday – daily from breakfast at 7 am to midnight talks by the campfire – for two solid weeks.  All the participants in the course came from such different biological backgrounds, but we all had a desire to use sequencing data to address our research questions.  We thought and laughed and debated and problem solved and bonded with each other and played lots of volleyball.  It, with all seriousness, was two of the best weeks of my life.

UPDATE: Check out fellow workshop participant Wayne Decatur‘s daily blog (and see also Proteopedia) on the workshop for more info!

Galaxy Workshop and Community Conference, July 2012

Galaxy is a free web-based platform for bioinformatics and data mining initiated by some of my colleagues at Penn State’s Center for Bioinformatics and Genomics.  The are Galaxy platforms popping up all over the place: JGI, JCVI, and you can run it on your own desktop or computer cluster.  In case you’d like to gain hands on experience using the platform or want to learn more about setting up your own Galaxy platform you can attend the 2012 Galaxy Workshop and Community Conference:

The 2012 Galaxy Community Conference (GCC2012) will be held July 25-27 at the UIC Forum at University of Illinois Chicago.

GCC2012 will run for two full days, and be preceded by a full day of training workshops. GCC2012 will have things in common with previous meetings (see GDC 2010, GCC 2011), and will also incorporate new features, such as the training day, based on feedback we received after the 2011 conference.

GCC2012 is hosted by the University of Illinois at Chicago, the University of Illinois at Urbana-Champaign, and the Computation Institute.