BarCamp Cambridge - James Smith talking about Ensemble, head of the internet team for Ensemble

Mon Jul 14, 2008

641 Words

Tags: cambridge, barcamp, mac, perl, web, rss, talk

Ensemble came out of the human genome project about 8 years ago to preventcommercialization of genomic data.

the idea was to have an open source human genome

companies would have to do some work before they could make money off ofsequences.

the ensemble projects takes the raw data from the genes and adds other datato this, such as reference data from other experiments

there is enemble codeand there is the data

there are 41 genomes,

the code is also used elsewhere from this project

everything is OS

there are probably about 100 instaled copies world wide

it is 1.5 milion lines of perl code

major pharma companie use it and layer their hose data on top if the publicdata

there is a public mysql interface

ww.ensembl.org (no e on the end)

there is also an archive system to see old data

everything is in CVS

there are about 40 people involved directly from the gene builders throughto the comparative groupsthere is a funtional annotation of the genomethere is the web team, an outreach team a helpdesk team.a warehouse team.and others ..there is support from the core web team,

scale

35 species in ensemble, human mouse rat zebra fishthen there are random mammallshedgehogs, many mammals from madagascarthe platapus has a poisned claw

they are runing half a million search index queries on one machine, thismakes them about the 5thlargest search index in the world

about 2 million page impression a week100 gb's of data traffic

they have 20 4 core machines, about 80 cores to run the site

BLAAST SSAHA servers

using 40 TB's of data at the moment

you expect hardware failure every week, and they don't let you know

at this point about hardware failure every day

currently on 3rd set of web code

2000 human2001 mouse2001 fly2003 Vegas site2004 archive site started2005 web code v32006 users and groupsin about a month ensembe 50 will be released

also have a number of other sites

they have a two month cycle for releasing data, and code.the day after each release they start building genes again

many data sets take longer than this, for data, the new mouse sequence wasreleased by ncbi 6 months ago,but it has taken this long for sanger to do the annotation and comparativework.

there is a pre-site for data that didn't quite finish within the two monthcycle

VectorBase - ensembl for desiese vectorsGramene - esembl for plantsCosmic - uses the drawing code

they are moving over to AJAX because people don't realize that items in theinterface are buttons or forms.a lot of the interaction is human interactionthey hope they can make ajax that does not break the screen readers, hopethat ajax will offer a web servicesplatform. this leads to issues of display vs data markup.

webcode is extensible by plug-ins.can add code which resides outside the main ensemble CVS tree - butaccessible from within.

and that's it

Questions:

Q: how does MySQL cope?

it copes really well, they have 150 GB, about 5GB is in RW DB the rest is inRead only DBthe issue is not the size of the data, but the number of tables.one of the DB's has 3000 tables, so they have very careful balancing of dataon the serverssome problems come from MySQL not being able to have keytalbes larger than 4GB,and when you have 60GB of memory then you run into this problem.

the bottlenecks tend to be in the code layer, not in the DB

this is one of the largest MySQL DB's in the world

currently using 4.something, keep planning to move to 5, but keep findingother things that are more important.

there are a lot of left joins in some queries.

sometimes it is easier to do these joins in perl rather than inMySQL,millions of times faster than in MySQL

connected to the net via a 1gb net to Janet.

This work is licensed under a Creative Commons Attribution 4.0 International License