BarCamp Cambridge - James Smith talking about Ensemble, head of the internet team for Ensemble
Mon Jul 14, 2008
Ensemble came out of the human genome project about 8 years ago to preventcommercialization of genomic data.
the idea was to have an open source human genome
companies would have to do some work before they could make money off ofsequences.
the ensemble projects takes the raw data from the genes and adds other datato this, such as reference data from other experiments
there is enemble codeand there is the data
there are 41 genomes,
the code is also used elsewhere from this project
everything is OS
there are probably about 100 instaled copies world wide
it is 1.5 milion lines of perl code
major pharma companie use it and layer their hose data on top if the publicdata
there is a public mysql interface
ww.ensembl.org (no e on the end)
there is also an archive system to see old data
everything is in CVS
there are about 40 people involved directly from the gene builders throughto the comparative groupsthere is a funtional annotation of the genomethere is the web team, an outreach team a helpdesk team.a warehouse team.and others ..there is support from the core web team,
35 species in ensemble, human mouse rat zebra fishthen there are random mammallshedgehogs, many mammals from madagascarthe platapus has a poisned claw
they are runing half a million search index queries on one machine, thismakes them about the 5thlargest search index in the world
about 2 million page impression a week100 gb's of data traffic
they have 20 4 core machines, about 80 cores to run the site
BLAAST SSAHA servers
using 40 TB's of data at the moment
you expect hardware failure every week, and they don't let you know
at this point about hardware failure every day
currently on 3rd set of web code
2000 human2001 mouse2001 fly2003 Vegas site2004 archive site started2005 web code v32006 users and groupsin about a month ensembe 50 will be released
also have a number of other sites
they have a two month cycle for releasing data, and code.the day after each release they start building genes again
many data sets take longer than this, for data, the new mouse sequence wasreleased by ncbi 6 months ago,but it has taken this long for sanger to do the annotation and comparativework.
there is a pre-site for data that didn't quite finish within the two monthcycle
VectorBase - ensembl for desiese vectorsGramene - esembl for plantsCosmic - uses the drawing code
they are moving over to AJAX because people don't realize that items in theinterface are buttons or forms.a lot of the interaction is human interactionthey hope they can make ajax that does not break the screen readers, hopethat ajax will offer a web servicesplatform. this leads to issues of display vs data markup.
webcode is extensible by plug-ins.can add code which resides outside the main ensemble CVS tree - butaccessible from within.
and that's it
Q: how does MySQL cope?
it copes really well, they have 150 GB, about 5GB is in RW DB the rest is inRead only DBthe issue is not the size of the data, but the number of tables.one of the DB's has 3000 tables, so they have very careful balancing of dataon the serverssome problems come from MySQL not being able to have keytalbes larger than 4GB,and when you have 60GB of memory then you run into this problem.
the bottlenecks tend to be in the code layer, not in the DB
this is one of the largest MySQL DB's in the world
currently using 4.something, keep planning to move to 5, but keep findingother things that are more important.
there are a lot of left joins in some queries.
sometimes it is easier to do these joins in perl rather than inMySQL,millions of times faster than in MySQL
connected to the net via a 1gb net to Janet.