Tag Archives: Data Analysis

Summer 2013 Bioinformatics Workshop Roundup Part Two

Here’s a couple more promising bioinformatics workshops taking place in the summer of 2013:

Metagenomics: From The Bench To Data Analysis, Heidelberg, Germany, April 14th to April 20th, 2013

EMBO course header

Joint EU-US Training in Marine Bioinformatics, Newark, Delaware, USA, June 16th to June 29th, 2013

EU-US Course Header

Summer 2013 Bioinformatics Workshop Roundup Part One

The summer is a great time to learn some new skills and really hone data analysis techniques.  I think it’s best to learn some topics — bioinformatic tools and data analysis scripting in particular — as intense multi-day workshops or a week- or two-week long short courses.  Here’s a few courses that are being held this summer that may be of interest to you.  I’ll be sure to post more as I hear about them.

Programming for Evolutionary Biology, Leipzig, Germany, April 3rd to April 19th, 2013

course one

Informatics for RNA-sequence Analysis, Toronto, Canada, June 3rd to June 4th, 2013

course two

Pathway & Network Analysis of -Omics Data, Toronto, Canada, June 10th to June 12th, 2013

course two

Adding Dropbox To Remote Machines At The Command Line

When I was recently at Titus Brown’s Next-Generation Sequencing and Data Analysis Workshop we were using remote computers at the command line for our workshop exercises.  In the workshop we used Dropbox to transfer files from one computer to another.  I had no idea that I could easily install and utilize Dropbox at the command line on remote machines.  I’ve reproduced the tutorial from the workshop here with some minor changes of my own.

Knowing how connect Dropbox to remote machines has saved me some time transferring files and it’s been extremely helpful on many different levels.  I can quickly pipe or send output text or images right to numerous shared devices.  I can check on the progression of a pipeline running on a remote server with my mobile phone by looking at output images or files (or even quickly checking file sizes).  Visualization at the command line is non-existant, so if I want to see an output figure, I can look at data output graphs quickly from Dropbox, and, if I choose to do so, can put images in a shared folder for a colleague to inspect in a matter of seconds.  Before you use Dropbox at the command line, you’ll have to set up a Dropbox account.

To link Dropbox at the command line on your home computer or, perhaps more importantly, on a remote machine, you should start at the location where you want to put your Dropbox folder, such as your home directory on your machine.

$ cd

Next, you’ll want to download Dropbox (here, for Linux-based machines):

$ wget -O dropbox.tar.gz "http://www.dropbox.com/download/?plat=lnx.x86_64"

…and extract the zipped file:

$ tar -xvzf dropbox.tar.gz

Next, you should run the dropboxd program:

$ ~/.dropbox-dist/dropboxd &

After running the program, you should see a message like this:

$ This client is not linked to any account... Please visit https://www.dropbox.com/cli_link?host_id=XXXXX to link this machine.

You will next want to copy and paste that URL into your Web browser.  While in your browser log into dropbox.  BAM!  The folder ~/Dropbox is now linked to your home directory!

By accessing Dropbox in your browser you can modify which computers will be linked to your Dropbox account by accessing the “My Computers” tab in Account Settings option.

I prefer to use Dropbox and most of my colleagues do too, but I’m sure you can do this with other file sharing platforms such as Google Drive or SugarSync.

MSU Next-Generation Sequencing Analysis Workshop (Computer Summer Camp) 2012

I returned more than a week ago from Titus Brown’s two-week “Next Generation Sequence Data Analysis Workshop” at Michigan State University’s Kellogg Biological Station on Gull Lake in Hickory Corners.  There’s really so much to say about this course and I can’t quite possibly cover it all here.  I’m also still processing it all in my head and will probably be doing this for quite some time.

The course was a rigorous two-week workshop on current “next-generation” sequencing technologies and methods for data analysis.  You could describe it as an advanced bioinformatics boot camp – exercises focused on text manipulation and data processing at the command line, Python scripting, and statistical analysis using the R language.  You can find more information about the course, including all the tutorials, at Titus’s ANGUS website.

Titus, Ian Dworkin, and Istvan Albert instructed this intensive workshop – which sometimes consisted of ten hours of lectures and exercises a day.  Guest speakers/lecturers included Corbin Jones (Sequencing technologies and data analysis), Erich Schwarz (Genome Assembly and WormBase), and Julian Catchen (RAD sequencing and the Stacks program for population genetics from next-generation sequencing data).

We implemented Amazon Web Services (AWS) to create virtual UNIX machines.  This standardized our exercises and also conveniently made the course equal opportunity for users of all operating systems.  Course participants could use the virtual machines for the workshop tutorials or to begin analysis of their own data – the only requirement was to bring a laptop to access the internet.  Amazon provided a learning grant to the instructors for all the course participants to use their Elastic Cloud Computing (EC2) service for tutorials and data analysis.

I can see lots of benefits to both teaching and conducting research using disk images housed in the cloud.  One such benefit is that all the students in the course – as well as anyone, for example, who might want to replicate your publication data analyses – can use an identical disk image you provide.  This eliminates any issues a user may experience from differences in operating systems, program dependencies, and processing capabilities.

The UNIX–based workshop participants (who were using Mac OS, Fedora, and Ubuntu) were able to ssh using their terminals while the DOS users used PUTTY to login into AWS.  The command line was the focus of the course – obviously this is the way that programmers and bioinformaticians do things and this workshop was a confidence building crash course at the command prompt.  Also relying on just the command line helped reduce band-width as we had close to 30 laptops using the wireless connection in the room.

The Python workshop exercises were implemented using the newest version of iPython notebook, a platform for utilizing the iPython toolkit as a browser-based notebook remotely from our EC2 instances.  While there were minor hic-cups from the iPython Notebook platform (see Titus’s blog post), I was extremely impressed with the power of programming in python from this interface, so much that I immediately tried installing iPython notebook on my computer (and when I had some issues with matplotlib dependencies in my native installation, I then opted for the educational Enthought version).  Especially for my level of proficiency with programming in Python, this platform is fantastic for the interactive in-line visualization, the handiness and speed of de-bugging line-by-line, and the overall ease of use – it just makes me want to be programming in python.  I’ve already spent more time programming in Python in the short time I’ve returned from the course than I have prior to the course.

The workshop also focused on the statistical programming language R to analyze and plot RNA-Seq based data from Illumina sequencing.  We focused on using R at the command line, but I personally have used R in the RStudio platform for a few months now and would recommend this interface.  In addition to Python, the R language and all the great add-on packages are amazing for graphical representation of data.

We spent a couple of afternoons working on raw text manipulation using awk (see bioawk!), sed, and other UNIX commands.  One of my favorite nights was when the instructors went to the computer to use tools they had never used before, with scripting and command line help from the other instructors.

Probably the best aspect of the course was the intimacy and interactions of all the students, instructors, and teaching assistants.  We basically were together as a group everyday – daily from breakfast at 7 am to midnight talks by the campfire – for two solid weeks.  All the participants in the course came from such different biological backgrounds, but we all had a desire to use sequencing data to address our research questions.  We thought and laughed and debated and problem solved and bonded with each other and played lots of volleyball.  It, with all seriousness, was two of the best weeks of my life.

UPDATE: Check out fellow workshop participant Wayne Decatur‘s daily blog (and see also Proteopedia) on the workshop for more info!

Book Review: Practical Computing For Biologists

I have a couple of book reviews in the pipeline, so I am starting a new category for review of books I find useful (or not so useful).  I wrote this review for this great book months ago, but, like many things in my life, I’m just now getting it online.

practical computing for biologists
Like many people, my research has been changing in recent years.  I have been spending an increasing amount of time in front of a computer and less time at the lab bench.  I can’t see myself ever forsaking the wet lab or field experiments, but I’m using computers more than ever before.  There’s now so much data to process – mostly text in the form of sequence data – and I’ve become increasingly reliant on a computer to search large data sets and convert data file formats.  Even if you aren’t a biologist in the area of genomics/genetics, new data collection instruments for physiology, ecology, and atmospheric sciences are recording data at incredible rates, and, additionally, sorting through citations is getting more and more time consuming.  It’s impossible to ignore the data revolution that is taking place no matter where your foundation within the biological sciences (or physics, chemistry, etc.) lies.

I wish the book Practical Computing For Biologists (and Companion Website), by Steven H. D. Haddock & Casey W. Dunn, would have come along sooner, but I am so glad it’s available now, because learning to deal with data more efficiently is where this book comes in.  When considering my research and use of time, this book has been the most important book I’ve read in the last year, perhaps the last decade.  If you’re a biologist (or anyone for that matter) who finds themselves clicking away at a database file (such as Excel) or cutting and pasting from online data repositories (such as GenBank, national weather databases, etc.) then this book is for you.  In reality, this book is for anyone who wants to use a computer to work more efficiently with data.

The book can be broken up into six sections dedicated to the following topics: (1) manipulating and searching text files, (2) working within your computer’s shell, (3) basic programming for biologists, (4) combining methods (this is a section on database management and tool selection), (5) dealing with graphics for data communication, and (6) advanced topics such as remote computer access and installing software.

This book devotes a large portion, and rightfully so, to addressing how to manipulate text files and other file formats used to store and communicate data.  Beginning with text editing using regular expressions, what I learned in the initial chapters immediately saved me time during large text processing and parsing of sequencing data.  A section at the end of the book focused on remote access and remote scripting helped me to start dealing with text and files on other computers.

The book focuses on Unix based platforms (Linux, OSX) due to ease of programming, but it does not ignore DOS (Windows) based platforms.  An appendix at the end of the book is useful in translating one platform to another.  When the book recommends the use of specific software, which is rare, the focus is on free open-source options.  The programming language Python is the language of choice for much of the book, but an Appendix at the end of the book helps to sort out differences in the many programming languages used in biology.  The open-source MySQL database platform is addressed for storing and communicating data.  One important goal of the programming and data organization aspect of the book is to standardize reproducibility and improve collaborative work through automation and transparency.

Surprisingly little attention is given to the actual communication of data in graduate coursework and training, so it’s refreshing to see a focus on image basics communicated here across a few chapters in the book.  These sections focus on basic image creation and manipulation using both commercial and open-source options.

Striking a perfect balance by guiding you through tutorials and nudging your own self-exploration, the book has just enough guided direction to not annoy or overwhelm.  This text is not a solution cookbook, but, more importantly, a guide to help get you started in data analysis and file format manipulation and to help you think for yourself to address your research problems.  While this book will help you deal with text, it doesn’t address software for word processing (Word, OpenOffice), Presentation (Powerpoint, Keynote), Spreadsheet (Excel), or statistics (R, SAS, SPSS, etc.), as this would create a huge giant book.  This book does not cover software for phylogenetics or population genetics and I don’t think it should.

Just to be clear, I’m not being paid here to promote this book.  I just honestly have found this book extremely helpful to my own research and I want to communicate that.  I haven’t read many books which have been able to change my life in a self-actualizing way, but this book helped (…and is still helping) me to do what I was doing before, but more efficiently.

Galaxy Workshop and Community Conference, July 2012

Galaxy is a free web-based platform for bioinformatics and data mining initiated by some of my colleagues at Penn State’s Center for Bioinformatics and Genomics.  The are Galaxy platforms popping up all over the place: JGI, JCVI, and you can run it on your own desktop or computer cluster.  In case you’d like to gain hands on experience using the platform or want to learn more about setting up your own Galaxy platform you can attend the 2012 Galaxy Workshop and Community Conference:

The 2012 Galaxy Community Conference (GCC2012) will be held July 25-27 at the UIC Forum at University of Illinois Chicago.

GCC2012 will run for two full days, and be preceded by a full day of training workshops. GCC2012 will have things in common with previous meetings (see GDC 2010, GCC 2011), and will also incorporate new features, such as the training day, based on feedback we received after the 2011 conference.

GCC2012 is hosted by the University of Illinois at Chicago, the University of Illinois at Urbana-Champaign, and the Computation Institute.