I returned more than a week ago from Titus Brown’s two-week “Next Generation Sequence Data Analysis Workshop” at Michigan State University’s Kellogg Biological Station on Gull Lake in Hickory Corners. There’s really so much to say about this course and I can’t quite possibly cover it all here. I’m also still processing it all in my head and will probably be doing this for quite some time.
The course was a rigorous two-week workshop on current “next-generation” sequencing technologies and methods for data analysis. You could describe it as an advanced bioinformatics boot camp – exercises focused on text manipulation and data processing at the command line, Python scripting, and statistical analysis using the R language. You can find more information about the course, including all the tutorials, at Titus’s ANGUS website.
Titus, Ian Dworkin, and Istvan Albert instructed this intensive workshop – which sometimes consisted of ten hours of lectures and exercises a day. Guest speakers/lecturers included Corbin Jones (Sequencing technologies and data analysis), Erich Schwarz (Genome Assembly and WormBase), and Julian Catchen (RAD sequencing and the Stacks program for population genetics from next-generation sequencing data).
We implemented Amazon Web Services (AWS) to create virtual UNIX machines. This standardized our exercises and also conveniently made the course equal opportunity for users of all operating systems. Course participants could use the virtual machines for the workshop tutorials or to begin analysis of their own data – the only requirement was to bring a laptop to access the internet. Amazon provided a learning grant to the instructors for all the course participants to use their Elastic Cloud Computing (EC2) service for tutorials and data analysis.
I can see lots of benefits to both teaching and conducting research using disk images housed in the cloud. One such benefit is that all the students in the course – as well as anyone, for example, who might want to replicate your publication data analyses – can use an identical disk image you provide. This eliminates any issues a user may experience from differences in operating systems, program dependencies, and processing capabilities.
The UNIX–based workshop participants (who were using Mac OS, Fedora, and Ubuntu) were able to ssh using their terminals while the DOS users used PUTTY to login into AWS. The command line was the focus of the course – obviously this is the way that programmers and bioinformaticians do things and this workshop was a confidence building crash course at the command prompt. Also relying on just the command line helped reduce band-width as we had close to 30 laptops using the wireless connection in the room.
The Python workshop exercises were implemented using the newest version of iPython notebook, a platform for utilizing the iPython toolkit as a browser-based notebook remotely from our EC2 instances. While there were minor hic-cups from the iPython Notebook platform (see Titus’s blog post), I was extremely impressed with the power of programming in python from this interface, so much that I immediately tried installing iPython notebook on my computer (and when I had some issues with matplotlib dependencies in my native installation, I then opted for the educational Enthought version). Especially for my level of proficiency with programming in Python, this platform is fantastic for the interactive in-line visualization, the handiness and speed of de-bugging line-by-line, and the overall ease of use – it just makes me want to be programming in python. I’ve already spent more time programming in Python in the short time I’ve returned from the course than I have prior to the course.
The workshop also focused on the statistical programming language R to analyze and plot RNA-Seq based data from Illumina sequencing. We focused on using R at the command line, but I personally have used R in the RStudio platform for a few months now and would recommend this interface. In addition to Python, the R language and all the great add-on packages are amazing for graphical representation of data.
We spent a couple of afternoons working on raw text manipulation using awk (see bioawk!), sed, and other UNIX commands. One of my favorite nights was when the instructors went to the computer to use tools they had never used before, with scripting and command line help from the other instructors.
Probably the best aspect of the course was the intimacy and interactions of all the students, instructors, and teaching assistants. We basically were together as a group everyday – daily from breakfast at 7 am to midnight talks by the campfire – for two solid weeks. All the participants in the course came from such different biological backgrounds, but we all had a desire to use sequencing data to address our research questions. We thought and laughed and debated and problem solved and bonded with each other and played lots of volleyball. It, with all seriousness, was two of the best weeks of my life.