Partially Attended

an irregularly updated blog by Ian Mulvany

science scraping with YQL

Mon Sep 6, 2010

255 Words

Last Saturday at Science Online London I gave a quick tutorial on YQL, and how it might be used to mash up scientific data sets. Below I list some of the sample queries that I was playing with. Before you get started with the console have a look through the documentation. I got a lot of milage out of the part about filters and joins. The blog post by Paul Hogan on using YQL for library maships was also very helpful.

In my presentation I was originally looking at extracing data from a report on soy bean rust spread.

Here are a few sample queries to get you started

1:

select * from html where 
url="http://sbr.ipmpipe.org/cgi-bin/sbr/county_info.cgi?date=2010-07-13&pest=soybean_rust&host=All%20Legumes/Kudzu"
This query just pulls all of the HML from the page. Open in console.


2:

select * from html where 
url="http://sbr.ipmpipe.org/cgi-bin/sbr/county_info.cgi?date=2010-07-13&pest=soybean_rust&host=All%20Legumes/Kudzu" 
and xpath='//table'
This query extracts only the table element from the page. Open in console.


3:

select * from html where 
url="http://sbr.ipmpipe.org/cgi-bin/sbr/county_info.cgi?date=2010-07-13&pest=soybean_rust&host=All%20Legumes/Kudzu" 
and xpath='//table/tr/td[2]'
This query pulls the second item from each row of the table. Open in console.


4:

select * from html where 
url="http://www.mulvany.net/files/ipmsemanticpipe.html" 
and xpath='//table/tr/td[@id="status" and p="Confirmed"]/..'
For this query I copied the table onto my own server and added some basic proto-semantic markup to the column descriptors. I could then call out specific columns from the table. Open in console.


5:

select * from csv WHERE 
url="http://www.mulvany.net/files/ipmpipe.csv" 
and columns='date,place,status'
and status='Confirmed'
With this query I converted the table into a csv file. This demonstrates YQL’s ability to query against csv files. Open in console.

This work is licensed under a Creative Commons Attribution 4.0 International License