FuturePub Jan 2015 - Lens Next

On 2015-01-27 I gave one of the short talks at the FuturePub event. My slidedeck is here. I wanted to give a quick update on where the Lens viewer for research articles is heading. Lens is a great platform for experimentation, and we have been iterating on some ideas towards the end of 2014 that have now made it into the 2.0 release.

The main update is that Lens can now be configured to accept information from a 3rd party source and display that information in the right hand resources pane. Lens converts an XML version of a research article into JSON, and then renders the JSON in it’s distinct two column layout. Up until now Lens was only able to get information from the article XML source file. We have added the ability for lens to check and API, and if there is a result returned, then Lens can be configured to show that information. As part of this you can define a new right-hand column for the specific source of information that you are interested in showing next to the article.

Here you see a screenshot of an image with a related articles feed, and you can check out the example.

related articles pane example

Here you can see an example of version of Lens that taps into the altmetric.com API, and you can play with the example.

altmetrics pane example

You can get started with this new version of Lens right now, and there is a starter repository with an example customisation ready to play with.

In addition to this major change, there has been a big improvement to the way Lens handles mathematics (thanks to contributions from the American Mathematical Society), and there have been a number of other smaller improvements too.

Other speakers in the evening

Christopher Rabotin from Spahhro

- they currently have 1.7 M pieces of content in the platform at the moment 
- they are launching a publisher API, so that people can push content into their platform, and see usage of that data

I’m looking forward to seeing this, I’m very interested in seeing what usage eLife content gets out of this platform.

Kaveah - Rivervalley Technologies

- Kaveah demonstrates his XML end to end workflow

This tool has come along nicely since the last time I looked at it. This is definitely the future of the interaction between authors and production, but the big gap remains in getting the original manuscript into this form. There are some moves in that direction, people moving to tools like authorea and writelatex, there are a number of other typesetters offering this online proofing environment, it’s an area of fairly rapid iteration at the moment, and I wish Kaveh good luck with this toolchain.

Scientific literacy tool - Keren Limor-Waisberg

This is a chrome extension that helps readers understand the context of the article that they are reading. You can check it out here. I’m going to take it for a spin over the next week.

Overleaf offline - Winston Li

This was technically the most impressive demonstration of the evening. A group of students have worked with Overleaf to connect it to git. You can now git clone your paper and work on it offline, and send a git push to update your online version. There are a few limitations, but this is a huge step for the product, and these students did it in about 3 months. What can you do with it? As Winston reminded us

As Turing tells us, you can do anything that is computable, it’s the command line!, it’s git!

Some thoughts on FORCE2015, science trust and ethics.

Last week I was at the FORCE2015 conference. I enjoyed it greatly. This was the 2015 instance of the FORCE11/Beyond the pdf conference. I’d been aware of these meetings since the first one was announced back in 2011, but this was my first chance to attend one. (If I recall, I’d even been invited to the DagStuhl workshop, but had been unable to attend. I’d been to one DagStuhl workshop on science and social networks many years ago, and that had been one of the best short meetings that I’d ever attended, so I’d been sad not to be able to go to the Beyond the PDF meeting).

This meeting was one of those where I felt totally surrounded by my tribe, I’ve been despairing at the inability of publishers to do little better than produce PDFs for research communication. I’m constantly grading our efforts as must do better. This was a meeting mostly filled with people who are interested in making things better (by which I mean the creation of tools and systems that help researchers, rather than pin researchers into metrics traps).

I read most of the 70 or so posters during the poster session. When I got home my wife said “nobody reads all the posters”. Some of the posters were great, and some were awful, but I read through most of them. I’ve put up a gallery of posters on my flickr account.

So, here are some relatively unconnected thoughts on the conference.

Riffyn looks like it might actually be useful. I’d been hearing about this product for about half a year now, and there is a small article in Nature that very much does not describe it, but up until this meeting it looked very much like vapourware, with no concrete explanation of how it might achieve what the company says it will. After chatting to and seeing his short presentation In my mind there are two components to the value offering. The first is a way to stream readout data from lab equipment to a central data collection service. The second is a software suite that allows you to modularise and encapsulate the components and variables of the experiment under consideration. At eLife we have been looking a lot at infrastructure for log analysis, and tools that can operate on streaming data, and I was reminded that the big data industry (sysops, devops), have been churning out tools to do a lot of this kind of thing for a few years now, so again it might be an example of where the research bench could learn from the software industry. I would imagine that a product like this will be of more assistance at the point where one wants to move from exploratory research to translation, and that initially the kinds of labs who will be setup to be able to take advantage of a system like this are likely to be the more organised ones, who probably don’t need something like this. It also reminds me of the experiments that Jean Claude Bradley was doing with his open notebooks science back in the mid 2000’s.

I greatly enjoyed the session on reproducibility. The speakers were mostly in agreement, it was mentioned in the session that some of the nuances over the difference between reliability and reproducibility were just semantic, of course the things about semantics is that it is a reflection of how we use language, and I found it valuable to be presented different vantage points from which to look at the topic. In a nutshell, when we say we want reproducible science, what we really want is for the claims about the world that are made to be reliable ones. We want an operational definition of truth (on the topic of truth there is an outstanding episode of In Our Time on the topic). Bernard Silverman described it in the following way - a claim or a paper is a virtual witness to the act of a person finding something out about the world. Witnessing requires trust, and trust requires the existence of a network on which to build that trust, and that network needs an ethical framework within which to operate. Science has an ethical framework (it seems to me that at the heart of this ethical framework are that we don’t lie about reality, and we grant recognition and dignity to others and to the natural world). In this context the reliability of results are a necessary, but not sufficient condition for ethical work.

As an aside how this ethical framework has emerged in the sciences is fascinating, for every behaviour that we idealised it is easy to look back at distant and recent history to see successful practitioners and results that contravene our ideas of ethical science. The first surgeon to succeed - ‎Christiaan Barnard - experimented in a totally unethical way. Mendeleev’s results were not trustworthy, Newton did not openly communicate his work.

As a further aside it also seems to me that online networks that get created by commercial entities operate within their own micr-ethical framework, and as long as we passively participate in them with no say in their governance or mores of operation the likelihood is that they will have a strong tendency to overstep our ideas of ethical behaviour.

There was some dissent amongst the speakers (well from one speaker), on the need for reproducibility. He did say that he didn’t think he was going to have a very receptive audience for his views, and that’s certainly the case for this member of the audience. I think he was complaining about the idea of the need for reproducibility, basing this on the claim that we have never had reproducibility in science, whereas if we understand that what we want is reliability, and if we recognise that we are concerned that there may be areas of research that are highly unreliable, then his objection falls apart. I think it is important to ground our calls for reliability in science to the instances where we fear there may issues. That covers a number of behaviours - making results available on request, providing all of the steps in the paper that are required to replicate an experiment, not making shit up. Things like the reproducibility project cancer biology can help to provide a survey overview of a field and give some evidence on what the nature of challenges around reliability we may face in a specific discipline. One of the people working on this project mentioned to me that at least one of the papers that they wanted to do a reproduction study on was impossible because the lab involved refused to share critical material with them. Many of the other papers that they are looking at are impossible to replicate just from the descriptions given in the published papers (most of these labs are helping, but perhaps it exposes the limits of the research paper as a format for communicating extremely technical and complex information).

I believe that I understood that this speaker mentioned that reviewers being asked to understand the details of the paper to the extent of being able to determine whether these papers could be reproduced or not would add too much burden on the reviewers, leading to reviews of much lower quality. I guess that depends on the nature of the review being asked for - whether it’s asking to check for rigour, novelty or elife-ness, but I would hope that one would not drop the requirement for rigour just in the search for novelty, and I would hope that reviewers would keep an eye open to ask whether the claims made in the paper can be supported by the evidence ad the tools used to gather that evidence.

On the topic of statistics this leads to an interesting question. At this point it’s well documented that many papers use incorrect statistics, or statistics or low power. It might well be that a reviewer may not have the time to run these numbers, it would be great if we could automated these kinds of checks. A precursor for that is making the underlying data available.

When it comes to making the underlying data available, someone from the audience raised the question of what are we to do with results that come from resources that have un-recreahable data sets (like google and Facebook). I think the inference was that such places hide their data and don’t make it available, and yet do a lot of research, hence could we trust the results that come from these places. The panel had a good comment on this. They extended the example to that of CERN, a facility that it will be impossible to ever replicate. Many of the experiments that happen in CERN can never be replicated, but the people in CERN operate as if they could. By putting in place the working methods of making their work in-principle possible to recreate, they produce better science. As one of the panelists said - if an experiment is truly reproducible, then you would never need to reproduce it (which comes back down to the issue of trust and reliability). Indeed it would be unethical to reproduce certain classes of research, such as widespread epidemiological or clinical trials. Once you have a cure, it would be unethical to withhold it for the sake of replicating the experiment. I think that the cases of big data - such as at Google and Facebook, have at least two further motivations to keep them honest - to a certain degree. The first of these is the profit motive. They are not in-principle doing experiments with their data for the sake of it, they are attempting to gain market share and to devise produces that people will want to use. The success of their experiments is based on whether what they produced is used or not, not by whether they get a publication. Another powerful force at work in these organisations is the need to share code and resources, and effectively most of their engineers are commoditised, and are replaceable. A new engineer coming into a team has to be able to get spun up quickly, and needs to be able to run the analysis or deploy the code towards the shared good. In that light they use significant amounts of code review, they code to common standards, and they encapsulate their work in a way that makes it easy to scale horizontally. They keep the configuration of their systems in code, in a self documenting way, and, at least within the organisation, have broad openness about how they are doing what they are doing, which in a pleasing turn or thought brings us back to Riffyn, which is hoping to provide tools to allow that type of commoditisation to enter the lab.

Onto another track now, I’d like to recount a small conversation that I had with one of the people presenting a poster. The idea looked interesting, but challenging. I started to ask the person about how their idea would be useful to people, and I was told all about the potential benefits. I’m always interested in how an idea is going to get market share, because that’s often the hardest thing, especially in a marketplace that values conservative behaviour. At this point it became clear that nothing had been built yet, so I asked if this person had done any user testing on the basic idea, or had any plans to. I was told that there were no plans to do any user testing, and that in face getting to a working prototype was probably out of scope, but modelling the semantic relationships of the idea, and doing some computer science work on this side were probably more than enough work for the PhD that the person was working on. Hmmmm, I think this is a weakness of the more academic side of this field, I want to see things emerge that are useful for the research enterprise, I want to see us build on top of the great corpus of open content that is now emerging.

Phil Bourne mentioned this in his closing remarks at the conference, that there is still a way to go before open access can really prove it’s worth. Indeed some of the more interesting technology demonstrations I saw were working on top of closed source software, analysing closed corpora of content provided by the big publishers. My friends, we need to build, together we need to build that future that we want to see emerge on top of the great potential of open content. You don’t need permission, just go and start now.

synthesis of breakout session from day 1 - institutions and metrics

Thanks to – Kevin Dolby, Martijn Roelandse, Mike Taylor and Andrea Michalek for taking the notes from each of the breakout sessions, I have synthesised them here.

Altmetrics could be used as a way to indicate the pathway of impact

Institutions should define their game plan, what do we want to achieve, what metrics can help get us there they could give guidance to researchers on what platforms to adopt (the landscape is cluttered, but at the same time Institutions probably don’t know), that said funders are behind the principle that universities drive what metrics they want to collect, and the set of standards, instead of prescribing metrics (don’t get led by what gets measured, define the change you want to affect first).

Researchers still don’t know about these metrics, there are some routes to education, in particular via the library (a course exists in Sheffield), but there is a general sense of “it’s not worth the time”. A key is going to be to get the younger researchers to adopt. We need to find incentives (I’m not going to mention what form those incentives could take)

On the topic of adoption, it’s clearly discipline dependent (there is a first mover problem in a field, if others are doing it, it can be seen as more acceptable)

one university had faculty need to to spend “two points” on community outreach, managing the department twitter account allowed them to tick this requirements.

In contrast to what one group reported, another reported that younger researchers more engaged, those at the top - no interest

Making them personal was considered appealing

It’s clear we can’t mandate participation.

Can there be a standard? No, for reasons above (mainly around participation). So if there isn’t, then anyone who provides altmetric data will inevitably curate them, favourably. When does this become cheating? How do we learn how to read altmetric data as funders? What other evidence might be provided?

Gaming will happen, perhaps we could embrace that Altmetrics can incentivise researchers to make their research available in open access. In addition don’t forget the Humanities!!

Could be used to Raise the institutional profile, even help with track public engagement, however …

Metrics (both new metrics and traditional citations count “hits” so both negative and positive hits are counted equally. This leaves the “quality” problem there

Looking at what metrics are valuable around a discipline-level vs institution-level could also be interesting

Looking at late adopters of social media use, using social media metrics about their work could be helpful in showing the value of social media to them

Do please use standards, DOIs, ORCIDs etc.

Where artefacts live in different plances (when content is promiscuous), find a way to have usage data flow between different object silos.

How do we stop altmetrics being misused, a la JIF. Since there is no safety catch on altmetrics, people are free to mis-use them! Openness and communication and conferences like this one will help

FuturePub3 - September 2014 event

Welcome back for the next installment of FuturePub. There are tons of people at the meeting tonight, pizza and beers care of WriteLaTeX!!

# Sumika Sakanishi - Product Manager - ODI

They aim to encourage organisations to unlock data. They also work with individuals to help them unlock the value of open data.

Open data is free to use, reuse and redistribute it, e.g. CC-BY.

The open data instutute have created the open data certificate, an online tool to help open data owners publish their data. It’s a questionairre. At the end you get a certifciation telling you how open your data is.

You get a badge that you can put on your website. I remain bearish on the topic of badges on websites.

They have issued about 1k certificates in either draft of publised form. About 100 are fully published.

Tool is available here: certificates.theoid.org.

For more thoughts on openening data within research see my recent posts here, here and a summary here.

 Walacea - back sciecne you believe in


This is a crowd funding site for researchers.

Name inspired by alfred russell wallace, who crowd-funded his work in early 19th century terms.

They launced today with their first two proejcts.

Walacea receives a 5% comission on all projects.

It’s interesting that they put their comission up front.

They are hoping to be able to raise on the order of about 50k per project, or thereabouts. The inital two projects are aiming for about 15k. On the question of the addressable market, in the UK 9B is given by the public to charity every year, so that’s an indication of one way to measure possible addressable market. They contrast to experiment.com by providing a rewards program to donors.

(lot’s of interest in this talk, lots of questions).

Anna Sharman - cofactor, and the journal selector tools.

Other tools that exist - Journal/Author Name Estimator
Edanz Journal Selector - []

Cofactor is a complimentary tool that might be useful. It’s about broad scope journals, there are many options and it’s manually curated.

Includes queries aroud options such as

  • type of peer review
  • licence
  • gold v hybrid
  • APC
  • non profit vs profit
  • length limits
  • copyediting or not

The current focus is on broad scope open access journals.

Looks nice!

How are they going to tackle issues around scalability? Currently they are checking the data against the journal website. The issue of scalablilty needs to considered carefully.

The aim of the tool is to use it to bring people to the website in the hope that

BookGenie 451 - Andrew Mcfarland

Their mission is to improve research outcomes in higher education. They match a profile of the researcher against content coming from a publisher’s repository. They have very clever tools that do this.

They produce snips of content based on keyword search.

Their co-founder was a CTO at Amazon and was one of the people based on working on the Kindle.

They aim to sell micro-pieces of content for micro-prices, they take a 40% cut on sales of content to consumers. They have 4 publishers setup for a proof of concept.

There is a “textbook cirses”, too many unafordable textbooks in the higher education market.

The impact of Open Access is going to adversly affect publisher incomes (yay!! – my comment).

Search engines for academic content are poor, institutinoal based repository specific search indexes preofrm better than MS academic search or google scholar. The big question is how do you move the user behaviour. If you can get the user to search in an app, or in their book, that might be an option.

They are expecting BookGenie 451 to become the iTunes for academic search.

(There are soooooo many questions about this product, but I don’t have to ask any questions at this stage, becuase they are at such an early stage the the reality of creating a real product will iron out many of these questions, so I’ll come back and ask my questions in 18 months, if they are still around).

# Alan Hyndman - the latest from planet figshare

  • they have a figshare for publishers - suite of products
  • figshare portal is a library of supp data for a publiser
  • figshare datastore can handly up to 200GB file types
  • figshare innovations - any cool data realted prodcuts - related content engine on plos is driven by this tool, will do relatedness at the individual file level (sounds a little like the source data proejct that EMBO is working on).
  • HGV database - human genome variation database

# Mattias Priparri - new developments from papers, the citation tool of the future?

  • their citation tool is a stand alone tool, it can insert citations into almost any app on your mac.
  • it’s inspired by applications like Alfred and other quick launcer applications
  • they want their citation tool to be like a quick launcher for scientific content

# Wrapup

It’s a wrap, time for the pub!! The next futurepub event will be in January in Oxford, I’m going to try to make it.

thoughts on the ERC data workshop

On Thursday and Friday of last week I attended a European Research Council workshop on managing research data. It was well attended with about 130 participants brining views from across the academic disciplines. I’ve blogged my raw notes from day one and day two. In this post I reflect on the points I noticed that were raised over the two days. People have been talking about the increasing importance of research information for many years now, and a hope was raised in the opening comments that we might be able to provide solutions to the problems posed by the issues of research data, by the end of the workshop. I was skeptical about our chances of doing that. The risk at a meeting like this is that the same points and problems get regurgitated, problems are listed at too high a level, everyone calls on everyone else, or at least someone else, to step in and solve the problem. There were aspects of all of these issues, but there were also highly encouraging signs too, and signs of real progress in solving some of the perennial existential questions of research data. Over the course of the two days I made a note, when I noticed it, of when specific named issues, potential solutions, or novel points, were made.

By the end of the first day the problems list far outweighed the solutions list, but by the end of the second day that ratio had reversed. I’m going to briefly drill into each one in a moment, but before doing that I’ll touch on the highlights coming out of the meeting.

By the end of the meeting the chair put it well when he said that overall the feeling coming out of the meeting was one of unity, a shared desire and understanding that data should be open, and a shared understanding that some culture change is necessary. We have many parties interested in this issue, and we all want to move faster on the issue.

There were signs of real progress too. LERU have a working paper on research data, and the take home message is that university chancellors almost universally think that research data should be made open, and that this will be a high priority issue for them - once they figure out what they are doing about open access.

How to cite data is now solved, in principle. The FROCE 11 data citation principles solve this, what remains is implementation (already in progress in the life sciences), and then adoption. Adoption is going to be where the largest challenges lie, because if we have a mechanism for citing data, and researchers continue to turn up to meetings like this say how do I cite data, then obviously there is work to do. We have to continue this work until researchers turn up to meetings like this one and say this is how I cite data. We want data citations everywhere.

A working solution to how researchers can make claims on what data they have produced was demonstrated by Sünje Dallmeier‐Tiessen with the The ODIN project. Again there is work needed here to promote adoption, and work to do on usability and interoperability.

It wasn’t all light and harmonious music though, there were a few telling shadows, a few indicators that the problem remains a deep and challenging one. It was notable that no LERU university has any reward system or prize system in place for good use or reuse of research data, or any mechanism in place for rewarding excellence in the support systems for research data. There is a Dutch prize on this topic, but it’s clear that more can be done.

In fact, often a need in culture change was mentioned. It should be obvious where this change can best be affected - in the grant rewarding process and in the hiring process. The EU, indeed all funders, are wary of sticks, but let’s sow the fields of Europe’s rich plots of data with an abundance of carrots. Let’s make available specific funding to support bottom up approaches to training for data management. There is already an appetite with initiatives like software carpentry, the creation of figshare, the growth of data dryad. Goodness, we could even invest in library infrastructure for this purpose. Let’s set up a research track for pure data re-use with grants awarded to those who have projects that reuse the data of others, and give them the time and resources to clean up that data. Let’s make clear that data are a real research output that counts in assessment. There would be no requirement to do this, but researchers who did would have their work recognised where it matters most. There were a few calls that anywhere between 5% or 15% of all research funding should go to data management, but I think it would be better to look at how we can alter behaviours on the ground from the bottom up. Data is important. After all, the data is the science, or at the very least it is the embodiment of our articulation of how we have grappled with reality, and it is the trail that shows our direct engagement with nature.

Having real options for data management careers in research could also help in the short term, and in the medium term could help create a workforce that is skilled in the management of big data.

OK, so let’s now look at each of the points that I captured from the two days at the meeting. I’ll list these as either questions, problems or solutions. I’m grouping them into topics that seem to make sense to me, so my groupings don’t reflect the order in which these topics arose, but I hope by doing this I can provide a horizontal view across the breakout sessions from the meeting to get the common themes that emerged. I’ll list the solutions as they were proposed. They stand here for your consideration. I add my own commentary at the bottom of each section.

Moolah, money, cash,

Problems or questions raised

- Question: who pays, what do they pay for?

Solutions proposed

- Solution: provide funding for data sharing.  
- Solution: take a percent, say 15%, and set that aside in every grant for data sharing, curation and storage . 
- Solution: do bulk purchasing from providers, and distribute compute and storage credits to researchers.  

My comments

The inference seemed to be that it should mostly be funders supplying the cash through some mechanism. The idea of doing bulk purchasing for infrastructure, and then giving researchers credits is an appealing one. Such approaches will be good for big data, but will have little impact on the majority of instances of data that is created, things like individual excel files on an individuals computer.

Infrastructure and support

Problems or questions raised

- Problem: hiring domain experts, e.g. DBAs, on a temporary basis is hard.  
- Problem: formats need to be updated, needs to work in the long-term, needs intelligent curation, structures to support this does not exist.   
- Problem: manual labour is required, the current credit system does not support that.   
- Question: how might we provide more training.   
- Problem: data is useless without the computational infrastructure behind the data.  
- Question: how might we provide better infrastructure for data.     

Solutions proposed

Solution: have a bank of data domain experts ready in the library/institution that can be seconded out or hired on short term contracts by researchers  
Solution: create a profession of data curators  
Solution: The EU should take care of infrastructure EU-wide to promote a level playing field.  

My comments

All of these solutions are good in principle, however they will require real will to create these kinds of incentives. It was often mentioned throughout the meeting that these kinds of skills could be provided through the private sector, and there was real concern that such an approach might lead to restrictions on the data if that data becomes controlled by a private company. Academic publishers were mentioned. I find it hard to see an EU-wide rolling out of an army of data curators, I think that has to come bottom up, from within disciplines. I could see libraries making a case to equip themselves for this task, but I don’t see them as being the natural inheritors of that task. It seems that institution-wide facilities, or national facilities might be good places for these kinds of roles to reside.

Fundamental definitions

Problems or questions raised

Problem: open data has been defined by rich labs, it's ambiguous, and currently non-inclusive.   
Question: how can we get to an agreed understanding of what Open Data is, and what currency it has in research communication.    
Question: how might we define data, per discipline.    

My comments

No solutions were proposed for this topic, but a great point was made that what we are calling data sharing is really effectively data dissemination, as those making their data available are not usually waiting for some reciprocal piece of data (although to be fair that was the example for the motivation behind sharing geneomic data).

Management and interoperability

Problems or questions raised

Question: how do we combine heterogeneous data from within one discipline or study.   
Question: how do you deal with large unstructured data sets?    
Question: who sets the metadata structures in different communities? 
Question: where do we put our data?  
Problem: need coordination between different data repositories and related services.    

Solutions proposed

Solution: deposit you data into existing structured DBs where they are available.  
Solution: convince people to copy good data management plans (and follow them).    
Solution: create an open marketplace of good data management plans.    
Solution: make data management plans be a living document.    
Solution: include the data scientist at the point of experimental design.    

My comments

So these topics, where do I put my data, how, how does it interoperate. To make anything happen we have to look at things on a discipline by discipline level. Many disciplines have this nailed, and we must must must must must work as hard as we can to get appropriate data into it’s appropriate repository. If it goes anywhere else that piece of data might as well not exist. I had a detailed conversation about the feasibility of federating computation over these kinds of data sets across repositories, but at the moment there is no infrastructure or will to support an approach like that, however we don’t have to, because the primary home for that data exists.

For small scale - one off files- data that is important on a paper by paper basis, there are now stable DOI coining repositories. If we can get people to use them then the question of where this data should go seems to have a clear answer.

Two areas that represent significant challenges are domains that are just now stumbling into the era of unmanageably large data owing to tooling or data sources suddenly being able to produce more data than these domains had previously needed to work with. Domains that look at web scale data, and domains whose experimental equipment has vastly increased it’s output volume are examples, such as the digital humanities and microscopy. These research fields need to work hard on building up a shared infrastructure and data formats, and those efforts need to be supported.

The other area that represents a challenge is data which is heterogeneous, but needs to be integrated in order to tell a story. Before this workshop I’d not appreciated that this kind of complexity can live even within one experiment. Perhaps a research objects like approach, or an approach of adopting tools and methods from schema-less data stores and key-value stores in web applications, could be applied to these problems, I just don’t know enough to have a solid opinion on this yet.

Incentives, trust and ethics

Problems or questions raised

Question: how do you reward data dissemination? how do you provide incentives.  
Question: how do we deal with data fraud.  
Question: can you trust the data in a repository?
Question: how might we provide a proper citation and reward system.  
Question: how do we make data donation habitual?

Solutions proposed

Solution: give DOIs, or similar, to data.  
Solution: cite data in reference lists, use the [FORCE 11 data citation principles](https://www.force11.org/datacitation).  
Solution: reward data contributions .   
Solution: appropriately label datasets to support fine-grained attribution.  
Solution: develop a culture of acknowledgement.  
Solution: use embargoes as a mechanism to incentivise researcher to make timely use of their own data.  
Solution: give a prize for examples of good use of data (it's mention that there is a data prize in The Netherlands).  
Solution: provide certification for digital repositories.  
Solution: Funders should mandate open data.  
Solution: enforce data policies.  
Solution: create a code of conduct teaching young researchers about the ethical issues around data.  

My comments

A lot of people talked about how we can cite the data. Yo, HELLO!!, we already have a solution to this, you just have to cite the data. A functioning example of how a researcher had connected a data output from one paper to their ORCID profile was even demonstrated during the meeting. For the vast majority of use cases, this is technically solved, we just need to let people know that it’s solved. Indeed the following blog post from CrossRef describes how to intertwine data and literature citations.

There are, of course, some subtitles, what do you do with a data source or DB that has multiple contributors, what if your data source is evolving over time. There are potential solutions to these issues on the horizon, I’m excited to see where the DAT project gets to. The aim of this project is to allow the creation of fine-grained identifiers (along with contributor info) for every record in a DB. It’s basically git for data.

On the broader topic of giving credit for data outputs where those outputs can be identified, that is where the cultural change needs to come, and ideas such as setting up prizes or named chairs for data reuse are really good ones. In fact looking over the proposed solutions, most are indeed carrot-like rather than stick-like.

On the topic of ethical responsibility and data fraud, I love the idea that by making your data available, that is a huge dis-incentive to fraud. Reproducing experiments is hard, so even if your data is made available, your experiment might not be that easy to reproduce, but fake data tends to have a statistically significantly different signature to real data, and so the act of making your data available is an act of ethical responsibility.

Problems or questions raised

Question: how does the distance to the commercial market affect acceptance of and practices in data sharing.  
Question: can we introduce licences that can be interoperable for data?  
Question: who owns the data/a bacterium?  
Question: legal and ethical issues affect the use of such data.  

Solutions proposed

Solution: move towards an internationally level playing field on ethics for research.  
Solution: update the EU copyright directive.  
Solution: create an EU-wide directive on data policy for scientific research.  

My comments

It seems there are movements within the vast machine of EU legality to try to get to some state of normalisation on these issues, and I wish them the very best. One response that was clear from the floor of the conference around commerical involvement was the clear call to not give ownership of the data away to the private sector in the same way that ownership of the literature has been ceded to commerical publishers. In contrast to this the fact that data that is currently commercially held can be of high value to research was mentioned eloquently by the speaker who is building climate change models, and I think that position strengthens the arguments of the Open Data movement – even commerical data providers should be encouraged EU-wide to move to thinking about getting open data certification for their data.

ERC data management workshop, day 2

Well, here we are at day two. My notes on the first day are here. We will open up with a short overview of the breakout sessions yesterday.

Life sciences breakout - key points.

The only point that came up that I hadn’t really covered in my notes from yesterday was that the view was that scientists should not become experts in data management, but some training should help.

Physical sciences breakout - key points.

  • open access to data shows the true richness of the data
  • can validate the ownership of data
  • can attract collaborators from other fields
  • advantages of data sharing outweigh the disadvantages
  • process of data sharing starts at the level of instrumentation and common data formats
  • there should be the possibility of DOI-type labelling of data packages
  • how do you deal with large unstructured data sets?
  • legal and ethical issues affect the use of such data
  • there is a difference between observational and experimental data sets
  • how does the distance to the commercial market affect acceptance of and practices in data sharing
  • who makes the first move? – researchers, institutions, funders, societies?
  • do we need a new profession of data curator?
  • appropriately label datasets to support fine-grained attribution
  • develop a culture of acknowledgement
  • provide funding for data sharing
  • Embargoes are complex, different embargoes are needed for different levels, PI’s need some time to work with the data, data collected at the national level should be made open immediately.
    • an embargo can act as an incentive for the timely use of the data for researchers (they need to get that paper out before their data is released)
  • set aside 15% of each grant for data curation and storage
  • that old chestnut “standardisation vs interoperability”
  • update the EU copyright directive
  • who sets the metadata structures in different communities?

Humanities breakout - key points.

DigiPal was mentioned.

  • management must be done at the discipline level, not at domain level
  • needs to be done above the institutional level
  • sustainability is crucial for SSH
  • could SSH learn how to deal with Ethical issues from the life sciences?
    • need flexible sciences
  • ownership of data is discipline depended, one rule does not fit all
  • creation of infrastructures in not an ERC mandate (it makes one wonder why we might be here today)
  • need career recognition

### Open discussion on morning presentations.

Data management starts before the first data point is acquired.

Data and publications need to be tied together.

We need to get the right tools to researchers.

Representation of data is as important as data itself.

I remind researcher to cite data in their reference lists.

There is a discussion around whether raw data should be stored, of if it’s possible to derive the data from code, could that be sufficient, it seems agreed that this needs to be decided by the community to find their own norms.

Roles and responsibilities around costs are one of the main issues that universities are currently discussing.

(Today I learnt about the Digital Curation Centre in the UK, I feel a little bad that I’d not totally been on top of that before).

There is a discussion on data journals and data articles. (I’m not entirely sure that this conversation gets us anywhere further than describing the world as we find it).

There is a discussion around funding, it’s asked whether data management and storage for research data represents a new market for the private sector. Strong reservations are expressed by multiple people, and the idea is compared to what has happened with scientific publications.

Breakout session on incentives.

Paul Ayris - Implementing the Future: the LERU roadmap for research data.

  • each university needs a research data management plan
  • researchers should have data management plans
  • LERU recognises that data should be open by default
  • rewards and incentives for researchers need further development

Excitingly the rectors of the universities that comprise the LERU group were very positive about adopting an open data policy.

The point in the roadmap about incentives for researchers has the optimistic view that there will be real economic benefit from opening up data early, and that will lead to the creation of more resources downstream that researchers can later benefit from.

A significant barrier is that data is not part of the way that research evaluation is done. Everything still hinges on the research article.

Not all journals require data to be deposited. Researchers are not going to deposit data out of the goodness of their heart. There are few rewards for data sharing, even concrete rewards and prizes. No LERU universities have any such prize.

The recommendations on how to improve the situation include the common themes

- cite the data
- enforce data policies
- reward data contributions

Currently a good number of institutions have not developed a good research data policy, or data curation systems or policies. It’s not that it’s not important, it’s just too early in the process. Institutions are currently more involved with looking at open access, open data has just not made to to the top of the pile yet.

Most are planning to do something, they just haven’t started yet.

### Sünje Dallmeier‐Tiessen - Incentives for Open Science Attribution, Recognition, Collaboration.

Questions that come up from researchers

How do I find data referenced in this paper.

This dataset is great! Has the author shared more?

Why should I bother to share my data, no one will see it anyway.

Sünje is working with DataCite and ORCID on ODIN, a way to link data, papers and people. This kind of infrastructure can help answer many of the questions that people have today about data.

Again The Data Citation principles are mentioned.

She gives a great example of how Kyle Cranmer uses his ORCID profile to show how he has contributed to data creation on the ATLAS experiment.

(It looks to me that this question of data citation is now well within the realm of having been technically solved, so we need to move to advocacy, and we need to teach researchers how to do this. The question of “how can I cite data” has a clear answer. Getting people to find out about the answer is the next challenge).

Veerle Van den Eynden and Libby Bishop - Incentives for sharing research data, evidence from an EU study.

They looked at case studies from a number of EU countries across a number of different disciplines. There are a diverse range of methods for data sharing. The report will be online next week and the interviews will go into their university repository and will also be available (Open Data FTW!!).

The incentives that these researchers identified were:

- direct benefit
    - collaborations are more robust
    - career visibility
    - get wiser
    - is better for science

- norms
    - default in the research group
    - hierarchical sharing throughout their research career
    - conservative non-sharing cultures represent a challenge
    - openness benefits research, but individual researchers reluctant to take lead

- external drivers
    - funders
    - data support services
    - publishers

These external drivers are not the main drivers, but they do help to shift the landscape.

The big fear remains being scooped. We need to create a level playing field for sharing. Sharing failed experiments were mentioned in biology and chemistry was mentioned as being very important (but still people do not do this yet).

Data citation didn’t feel that they had to be able to track reuse of their data, but they were expecting citation for reuse.

Micro-publishing and micro-citation were mentioned as important, especially in the life sciences. You need to be able to provide atomic level identifiers.

The report and full recommendations will be available at http://knowledge-exchange.info.

Open discussion after breakout session.

It’s mentioned that there is an error in equating data publication with formal publication. It should be reported as a separate output. It’s also mentioned that in the humanities when data is cited the compilers of the data is currently not included in that data citation. (I have to say that I think that the commenters full comment is not inconsistent with the idea of actually including names in citations, even if they are not being used right now).

Someone asks for a data repository with an embargo for the period of when a paper is under review. Sünje mentions that Zenodo can support this.

There is a very interesting discussion around aggregation of data, vs the original collection of the data. A specific paper is mentioned where there are about 40 authors of an aggregation paper. The data that they aggregated were not in a state to be cited, they are not, at this point in time, citable. It’s put to one of the commenters that he could make a comment in the article on the journal platform to ask the authors to correctly cite the original data that they aggregated, and he said that he would be worried of making a comment like that, for fear of a negative impact on his future funding prospects.

I mention that research assessment needs to improve to seriously look at non-article contributions. I mention that researchers may need to look past the impact factor. There is an uncomfortable titter of polite laughter at the recommendation in the room, and we pass quickly over the point.

We do talk about the concrete steps that are out there to reward this kind of behaviour, and there are no institutions that formally recognise and reward these practices. That’s a bit of a red flag there.

We ask what is the kind of reward that would make a difference. It’s thought that money would be counter-productive. Research money would be nice. Researchers want help to do their work. They want good services. If they can find people to work with who are professionals in managing data, that would be helpful.

Tim Hunt mentions that the ORCID interface is terrible. Work on that would be very valuable. “if you don’t make a good interface, you might as well not get out of bed”.

We talk about whether software should be usable, would that increase the uptake of good behaviour, but there is no conclusion from the group on this point.

We come back to to the issue of what kind of a thing the data contribution is. Do we want databases to count as patents or publications? Do we not want them to count as databases?, actually the point is more about what kind of IP we want for the data, which actually makes a lot of sense as a question. There is a strong call to make the data open. I have some thoughts on the differences between patents and papers. This also touches on the question of who is the owner of the data?

Reporting session from working groups.

Data management and sharing.

- Issue: need coordination between different data repositories and related services

The key message is that a cultural change is needed when it comes to dealing with data.

Collection of personal data for scientific research is considered legitimate subject to safeguards, under the view of
EU data and privacy policies. They are moving towards a one stop shop model for these kinds of data use cases.

It is considered that data protection laws will not require additional resources from institutes (though that’s an opinion that flies in the face of common sense, so it will be interesting to see if it holds up).

Storage, curation and interoperability.

There was a speaker from Data Archiving and Networked Services. It was put that it would be good to

- provide certification for digital repositories  

A lot of technology is working now for managing data, but people don’t know about it, so we need to

- improve advocacy around existing solutions

Key points from this discussion were

- can you trust the data in a repository?

To get to that we need to understand the appropriate level of curation for the data. Metadata is critical. Scientific quality is the responsibility of both the researcher and the institute.

On fraud, who is responsible for it. If it’s found, who owns it?

How do you create a level playing field. It’s mentioned that the UK and the Netherlands are paying for repositories, but that might lead to less open access, as those bodies may decide at some point to no longer make their institutional repositories available to people outside of their institution.

Data discoverability access and reuse.

- deposit you data into existing structured DBs where they are available

Elixir is mentioned in this talk.

There is a new copyright exception in the UK, but this is limited to non-commercial uses. New copyright exceptions are coming online, but they are not perfectly fit, in their current form, to totally support Big Data reuse.

There is a comment that the work Elsevier has done on article of the future, with creating in-article visualisations, involved some discussions around whether these visualisations would be subject to copyright, as they were a derivative work of the original article.

It was mentioned that we need to keep an eye on the emergence of new data types or new technologies. An eye needs to be kept on return on investment.

There is data that shows that an article that has associated data published will get cited more.

If we want open data, then we should also have open access.

When it comes to copyright infringement of machine copying, what should count is not that a copy is made, but the intent behind the copying.

Rewards and incentives for good data management (the carrot session).

I’ve written up this session earlier in this blog post, so I’m going to pass over the summing up of the session.

Breakout session - post summing - discussion.

There is a comment that we need to support the skills for interpreting the data in addition to the skills for creating data. Time for a quick coffee.

That discussion session was fairly low key, I think we have hit maximum overlap on the issues, and we are definitely recycling both issues, and proposed solutions. What the concluding discussion will bring we will now discover.

Concluding discussion session.

PLOS mention that they are going to automatically start to collect usage of data, and extend their ALM activity towards data use. They have an NSF grant to look at this. I understand that this program is called “making data count”.

Good data management is good science!

The carrot is a better approach than the stick. We need to listen to what scientists are telling us about how they see this situation, and we need to be responsive to that.

When talking about raw costs for infrastructure, the purchasing power of an institution or a funder is much bigger than an individual researcher. This points towards an idea where funders possibly ought to do bulk negotiation, and distribute storage or compute credits to researchers, rather than raw funding. This is the approach the Phil Bourne is discussing with the NIH.

There is a discussion on costs. Storage is mentioned as being perhaps not a significant factor, compute and electricity are also mentioned. (I’ve done an estimate that by 2050 it will cost 1$ to store an exobyte of data, however the truth here is that costs are highly domain specific, and there is a wide distribution of use cases and levels of expertise amongst researchers, raw storage costs are only one aspect of the issue.) I think that a general discussion on this topic is not as helpful as identifying specific issues, or specific solutions.

The discussion on enforcement of policy is mentioned. The commission says that they want a bottom up solution, but it is mentioned that a data management plan represents a contractual obligation. (It’s fairly well known that funders are very shy of brandishing sticks, it’s unpopular, it could lead to unintended consequences, but when it comes to altering behaviour through financial incentive it’s hard to see options that could be as powerful as penalties for not sharing data as laid out in data management plans, though given the underlying complexity of different research areas I would not want to be the one to pull that trigger).

It’s mentioned that making papers, data and software open will give a benefit to industry and innovation.

We tip toe over to the topic of open peer review. I’ll just tip toe away from this topic right now, as it’s fairly off topic for this workshop.

Closing remarks

This has been a harmonious workshop. There is general agreement that we should have open access to research data, and we have many interested parties. We have a long way to go, we also have agreement that we need to change the culture at every level, and that we are possibly not moving fast enough. Being able to hire and obtain technical support has resonated, and has been mentioned several times (I’ll put in another shout out to http://software-carpentry.org.

Where does the data go? Who pays for it? Those are still big questions, and should be developed trans-nationally.

It’s mentioned that we need to identify specific repositories for specific disciplines, and I would refine that and say that we have very clear locations for specific kinds of data right now, what we need to identify are the fields that are struggling now, and in particular identify fields that are at early risk of walking into a data avalanche where there are no previous good examples of data care in those fields, and who have gotten into this situation due to new tools that have become available to them, for example microscopy.

Issues and questions that came up today.

- how do you deal with large unstructured data sets?
- legal and ethical issues affect the use of such data
- how does the distance to the commercial market affect acceptance of and practices in data sharing
- who sets the metadata structures in different communities?
- can we introduce licences that can be interoperable for data?
- who pays, who is responsible for paying?
- Issue: need coordination between different data repositories and related services
- can you trust the data in a repository?

Suggested solutions to issues that came up today.

- give DOIs, or similar, to data
- move towards an internationally level playing field on ethics for research
- create a profession of data curators
- appropriately label datasets to support fine-grained attribution
- develop a culture of acknowledgement
- provide funding for data sharing
- use embargoes as a mechanism to incentivise researcher to make timely use of their own data
- take a percent, say 15%, and set that aside in every grant for data sharing, curation and storage
- update the EU copyright directive
- give a prize for examples of good use of data (it's mention that there is a data prize in The Netherlands).
- convince people to copy good data management plans (and follow them)
- cite data in reference lists, use the [FORCE 11 data citation principles](https://www.force11.org/datacitation)
- create an open marketplace of good data management plans
- data managmeent plans should be a living document
- include the data scientist at the point of experimental design
	- (I'm remineded of a story from Janelia Farm ...)
- cite the data
- enfore data policies
- reward data contributions
- create an EU-wide directive on data policy for scientific research
- provide certification for digital repositories
- improve advocacy around exising solutions
- Funders shuold mandate open data
- The EU shuoljd take care of infrastrucutre euope-wide to promote a level playhing field.
- create a code of conduct teaching young researchers about the ethical issues around data
- depost you data into existing strucutred DBs where they are available
- do bulk purchasing from providers, and distribute compute and storage credits to researchers

ERC data management workshop, day 1

These are notes from the first day of the European Research Council Research Data Management & Sharing Workshop. I’ve also posted notes from the second day, and I’ll shortly add another post examining problems and potential solutions raised over the course of the workshop. Jennifer Lin from PLOS has also posted some excellent notes.

These notes are bit jagged, but I thought there was more value in getting them out in a rough form ahead of the RDS meeting that starts tomorrow, rather than waiting to get them into better shape, but missing that event. My apologies up front for errors, and incomplete sentences.

# initial thoughts about the workshop.

The opening document, that was distributed a few days before the workshop, highlights the great heterogeneity in how data is used, understood and licensed, across different disciplines. It’s a big old gordian knot. I advocate doing small simple things that move us, step by step, into a better future.

I will be keeping an ear out for tools that are in use in real workflows, and I’ll be keeping an ear out for any comments that float up during the course of the meeting that resonate for one reason or another. In principle this meeting is about the EU listening to the research community, and other stakeholders, and hearing what it is that we want as an appropriate future for how data should be managed.

(I’ll see if the docs can be posted somewhere, for the purposes of this blog.)

Opening remarks.

In addition to the normal remakes on heterogeneity, Professor Nicholas Canny made the excellent point that within the EU business model viability is also a real issue within the EU. What works in one country does not always work in another.

The nub of Prof. Canny’s remarks is that sustainability is the key issue, not only in terms of storage, but also in terms of verification and validation of the data at a later point in time.

A use case of interest about digital preservation and sharing is the Boston archive of the IRA recollections. This was collected with the promise that no material would be released until later, but then …. they were subpoenaed.

There are problems, and this conference is addressing those problems. The hope is that we can provide solutions to those problems.

The next speaker also focusses on publications, as the main route towards data. It’s mentioned that some disciplines have been very self-organising, however some disciplines are even lacking recognition that there is currently an issue. It would be interesting to me to find out which disciplines are lagging the most.

Sustainability also touches on the software, so it’s all about wares - hardware for storage, software for interoperability and wetware for expertise.

We are informed that the agenda is both heavy and efficient. I suppose like an elephant. It’s mentioned again that there is a hope that this workshop will provide solutions. I am doubtful that we will manage that, however if we can identify some small roadblocks, perhaps that might be sufficient, perhaps that will give us a few points against which we can apply some levers.

## Setting the scene.

(It might be good for me to try to capture specially interesting questions that emerge in these opening sessions, we can then review later and see if there are common themes.)

Sabrina Leonelli - the epistemology of data-intesive science.

(who is very very awesomely speaking with her very young baby on her shoulder, which is just awesome).

Q: how do we make data donation habitual?

Point: manual labour is required, the current credit system does not support that.

Point: formats need to be updated, needs to work in the long-term, needs intelligent curation, structures to support this does not exist.

Q: how might we create structures and systems to support data curation, and intelligent curation.

(on the question of how to update data, dat is a very nice potential solution, as are data packages, but these are both in an embryonic state. The reason that Dat is appealing is that it matches the model of git, and we know that git is successful, we know that git supports more items of code, in a way that is sustainable, reusable, and shareable, at a scale that currently dwarfs that number of researchers that are curating their own data sets, so if we had a system that could do the same for data that was provably as robust as git, there is an inference that we could make that such a system might be fit for purpose for the scholarly world).

Point: open data has been defined by rich labs, it's ambiguous, and currently non-inclusive

Q: how can we get to an agreed understanding of what Open Data is, and what currency it has in research communication

Sabrina makes the very interesting point that people are not very clear in their understanding of what is being shared, when they share their data, what do they get back in return. Sharing should be a reciprocal activity and what we do with research data is not a reciprocal activity. She prefers the term dissemination.

Dr Hans Pfeiffenberger - Open Science – opportunities, challenges … @datasciencefeed.

Everyone is afraid of data publishing, but who should be afraid? The people who make data up should be afraid of sharing their data. (the bottom line here is that researchers with shit practices should be afraid of data sharing, the inverse inference is that if you are afraid of sharing your data you might be considered to be a shit research, however that’s a stretch, and I want to be clear that Dr. Pfeiffenberger does not make this inference).

Q: how might we define data, per discipline

Q: how do you reward data dissemination? how do you provide incentives

Mentions Argo as a great example of an open data project. (I’d not seen argo before, it’s amazing). Dr. Pfeiffenberger gives examples of where making the data open had doubled the research output.
(The royal society report on science as an open enterprise is mentioned again).

NSF have started to ask for a list of five products from researchers, where the citable product can include a data set, and not only a research paper.

DFG rules have been amended to say that you can be an author of a paper if you contributed to the creation of some data.

A recommendation is made to not sign bad contracts. A recommendation is also made to fight against numerical assessment practices (this relates to DORA).

Bernd Pulverer - finding and accessing the data behind figures.

The data behind figures is critical, it’s currently hard to get to, EMBO is working on improving this situation.

Bernd emphasis that not all data is useful. Raw and unstructured data age rapidly.

Q: how do we deal with scooping?

(here are some slides on the source data project). A key thing they are working on is tagging and identifying information in papers at the figure panel level to identify methods, entities and authors of individual components of a figure. This will allow horizontal navigation based on data rich and resource rich facets.

Q: how do we deal with data fraud

Do we open the gates to heaven or to hell?

Dr Roar Skålin - Norwegian researchers want to share, but are afraid of jeopardising their career.

They surveyed researchers, and got responses from 1474 researchers, a response rate of just over 30%. This was statistically representative. A large number of researchers actively decided to opt out from the survey.

40 - 50% of researchers state that data is available, but on request.

Researchers in this system broadly reflect the concerns that we have seen from other studies, concerns about scooping, about misinterpretation, and the time and effort required to

Q: how might we provide a proper citation and reward system

Q: how might we provide more training

Q: how might we provide better infrastructure for data

Q: how is infrastructure organised, nationally, via publishers, via institutions, independent entities

Q: bottom up or top down?

Money is not the number one concern, more concern over infrastructure and training.

Most researchers are in agreement that data should be provided on publication.

Summary of points from the scene setting.

Q: how do we make data donation habitual?

Point: manual labour is required, the current credit system does not support that.

Point: formats need to be updated, needs to work in the long-term, needs intelligent curation, structures to support this does not exist.

Q: how might we create structures and systems to support data curation, and intelligent curation.

Point: open data has been defined by rich labs, it's ambiguous, and currently non-inclusive

Q: how can we get to an agreed understanding of what Open Data is, and what currency it has in research communication

Point: who pays, what do they pay for?

Q: how might we define data, per discipline

Q: how do you reward data dissemination? how do you provide incentives

Q: how might we define data, per discipline

Q: how do you reward data dissemination? how do you provide incentives

Q: how do we deal with data fraud

Q: how might we provide a proper citation and reward system

Q: how might we provide more training

Q: how might we provide better infrastructure for data

Afternoon breakout session - Life Sciences.

Short comments

These are going to be quite short, and then we will kick into discussions, so I’m going to aim to only outline these comments.

Iris Hovatta, Group Leader, Department of Biosciences, University of Helsinki.

She studies anxiety, and uses mice to study gene regulatory networks.

Most of their data is stored in excel files. Standardisation seems to be challenging. Mice in different labs seem to behave differently, even if genetically identical.

They also product RNA-seq data. This kind of data is well supported, many journals require it’s availability.

(On the excel data, I wonder whether they have experimented with any plugins, and how did they find them?).

They want to integrate their behavioural and expression data together. There is a lack of expertise in their lab for database construction. Hiring DBAs on a monthly basis is hard.

Point: hiring domain experts, e.g. DBAs, on a temporary basis is hard

Solution: have a bank of data domain experts ready in the library/institution that can be seconded out or hired on short term contracts by researchers

Q: how do we obtain informed consent?

### Dr Bouke de Jong, Head of Unit Mycobacteriology at the Institute ofTropical Medicine, Antwerp.

Studying TB, and TB transmission. They look at infection by comparing the genotypes of the bacteria from different patients within a study area. The geneotypic clustering is a proxy for recency of infection.

Challenges again reside around complex and large data sets, combining demographic and genomic data.

Q: how do we combine heterogeneous data from within one discipline or study

They have non-automated data collection issues. Cleaning the data is labor intensive. They are creating dedicated DBs.

If doing again they would like to enter data in the filed using bar-code methods to avoid double entry.

They have clustering at multiple levels.

(publishable unit is mentioned for the first time).

Q: where do we put our data?

Q: who owns the data/a bacterium?

In routine diagnostics patients have not given explicit consent. You might say that with anonymised data this might be OK, but the question of ownership of the bacteria remains an open question.

How do you provide service to the community like this (of giving data away) count as a key performance indicator when it comes to evaluation?

Q: what's the incentive?

(Narrative seems to be an issue, if the researcher can construct that narrative within a paper and we can tightly link back to those narratives from the data, would that help?)

Jo McEntyre asks a question about why not deposit some of this interesting MD with the core “genetic data”.

We have a quick conversation around what we do with “unstructured” data at the point of publication. If we got that working, that might be a route to help with this issue, but it might lead to more noise. This needs to be worked out a bit more.

There is a question about anonymising the data

### Sebastian Luyssaert - user and provider perspectives on data sharing at the interface between life and earth sciences.

He looks at managing forestry, with a view of looking at how that can affect greenhouse gases and climate change. They have a model with about 500k lines of code. They run this model on a big computational infrastructure.

One of the data sets that they use is forest inventory data from the EU. There are about 400k data points. This is economic data and is therefore hard to get access to. This data is held at the national level. They need to contact 30 bodies. It is very labor intensive to get this data.

Problem: data is useless without the computational infrastructure behind the data

Solution: share data and operational algorithms (data cubes)

He mentions fluxnet as an example of a community effort. Data sharing is doing via a fair-use policy, where the data is made available on request, but not all data are shared, so you still need to contact the PI to get the data.

Solution: make data sharing mandatory, like in [ICOS](http://www.icos-infrastructure.eu) & [NEON](http://www.neoninc.org/science/data)

He mentions that the old-school method has the advantage the it forces conversations between researchers, and collaborations emerge.

Where data is taken from an environment where there is a lot of competition, it is better for you to share your data, as if you don’t people will work with the others who have similar data that are sharing it. If you have data from a location that is hard to get this data, then it is better to not share your data, as people will contact you anyway, and you will get to be a co-author.

ESA recently realised that no-one was using their data. The reason is that NASA data was free and ESA data was expensive. ESA data is now free.

Data is so large that to do data sharing they buy disks and post it around.

There is no way to talk to people at NASA or ESA about how the data was produced.

After they improve their models they become more of a software provider than a data provider. At this point he is struggling a lot with conversations in the

130 person months will end up in one paper, in doing a large extension to the underlying software model. There is no tradition in sharing software, however the only way people will be able to do experiments in the future is that they will be able to use the software that this group created. They want to get credit for this work.

They are looking to distribute the model with a fair use policy. They have this software under version control, but how does this match up to IP? They are thinking of breaking the model up into components, and each component could come with a list of data and contributors to that component.

Q/Problem how do you give credit for software

He is worried that in the next year he is going to have to take up a lot of time in cleaning up his data and code, in order to be in a position of getting his next grant. He is more interested in doing science than spending time on cleaning his data sets.


We get into a really good conversation about tools, competency and training, Software Carpentry and Data Carpentry are mentioned.

We gather a list of potential issues with data in this domain. Someone mentions the issue of access to materials, or even accurate description of materials, such as reagents.

Data management plans are mentioned as potentially problematic in terms of overhead. The worry is that a data management plan is something people write at the start, but then they ignore them after getting the money. Thinking about them as a step for submitting proposals could be a problem. Leaving them to the level of institutions can be problematic. Could one think of a system where the plan is discussed before and after submission, and the funder plays a role of coordinating data management at a super-institutional level. It’s mentioned that researchers are now being asked to review data management plans, but often don’t feel qualified to make these peer review decisions. It’s mentioned that in the UK the humanities council has a special committee for reviewing technically heavy applications.

A question is raised about licensing of facts, it’s pointed out that you can’t licence facts.

All of the UK funders have said that it is acceptable to ask for funding for data provisions, but you can’t ask for a blank cheque, you have to justify the request in the grant. This automatically interacts with the institutional level, as you will eventually end up interacting with whatever resources are available at your institutions.

It’s mentioned that it’s a mistake to consider infrastructure as only hardware. It needs to expand it’s definition to include skills.

The issue of rewards and incentives is mentioned. Bernd mentions that making data available can help with discoverability. Making data required at point of publication is mentioned as a mechanism (but to be honest the researchers do not seem convinced).

We ask the researchers what incentives they need to see to become more open to the idea of sharing, we get a variety of answers

- seeing that people who share are more successful
- knowing that a shared body of knowledge can provide more power in terms of making scientific advances (when I can see more data through the act of sharing my data)
- already has benefitted from sharing, got 80% good experience, 20% bad experience, but is mostly concerned that if he has to make all of his data available it will be too much of a burden, will take time away from doing science.

(transmitting data is mentioned again, I wonder about sending computation to the data, rather than the other way around).

A comment is made that big data sets can be expensive to store, up to 30K for two years of storage. This can freeze out younger researchers. (Jo makes mentions again that we have places to put some data, but our systems do not cover all data types at this point in time).

We devolve into writing a power point slide via committee.

### Summary of issues from breakout session.

Point: hiring domain experts, e.g. DBAs, on a temporary basis is hard

Q: how do we obtain informed consent?

Q: how do we combine heterogeneous data from within one discipline or study

Q: where do we put our data?

Q: who owns the data/a bacterium?

Problem: data is useless without the computational infrastructure behind the data

Q/Problem how do you give credit for software

Problem: access and description of materials is often poor

Summary of solutions from breakout sessions.

Solution: have a bank of data domain experts ready in the library/institution that can be seconded out or hired on short term contrasts by researchers

Solution: share data and operational algorithms (data cubes)

Solution: make data sharing mandatory, like in [ICOS](http://www.icos-infrastructure.eu) & [NEON](http://www.neoninc.org/science/data)

Solution: docker or vagrant

Summary of issues from the workshop.

Point: hiring domain experts, e.g. DBAs, on a temporary basis is hard

Q: how do we make data donation habitual?

Point: manual labour is required, the current credit system does not support that.

Point: formats need to be updated, needs to work in the long-term, needs intelligent curation, structures to support this does not exist.

Q: how might we create structures and systems to support data curation, and intelligent curation.

Point: open data has been defined by rich labs, it's ambiguous, and currently non-inclusive

Q: how can we get to an agreed understanding of what Open Data is, and what currency it has in research communication

Point: who pays, what do they pay for?
Q: how might we define data, per discipline

Q: how do you reward data dissemination? how do you provide incentives

Q: how might we define data, per discipline

Q: how do you reward data dissemination? how do you provide incentives

Q: how do we deal with data fraud

Q: how might we provide a proper citation and reward system

Q: how might we provide more training

Q: how might we provide better infrastructure for data

Q: how do we combine heterogeneous data from within one discipline or study

Q: where do we put our data?

Q: who owns the data/a bacterium?

Problem: data is useless without the computational infrastructure behind the data

Summary of solutions proposed by the workshop.

Solution: have a bank of data domain experts ready in the library/institution that can be seconded out or hired on short term contracts by researchers

Solution: share data and operational algorithms (data cubes)

Hallway conversations.

I chatted briefly with an engineer over coffee. She described some of the data that they deal with when looking at modelling the potential effects of building tunnels under a city, and that effect on the buildings on the ground.

Bingo card terms

Lot’s of topics come up again and again, so I’ve quickly created a small set of data sharing bingo cards. I’ve used the following terms:

  1. who pays
    • relation to open access ?
    • royal society report
    • privacy concerns
    • the humanities are different
    • data standards
    • embargos
    • how do we cite data
    • data quality
    • big data
    • unreproducable science
    • legal restrictions
    • licensing
    • I’ll get scooped
    • no time to share
    • my data will be misunderstood
    • there is no infrastructure
    • my data is sensitive
    • bottom up
    • top down
    • sustainability
    • PLOS
    • discoverability
    • incentives
    • data citation
    • publishable unit
    • supercomputer
    • anonymise the data

Some pittfalls in using iPython to generate talk slides

Yesterday I gave a talk using iPython notebook to generate the talk slides. You can get the notebook on github, and you can see a live version of the slides.

It succeeded in generating an artefact that was somewhat literate, in that the code and documentation are nicely woven together, so anyone who has the time can get to exactly the same point that I got to, with this repo, however there were also a couple of problems that I ran into that make me feel that this is not yet ready for mainstream use, specifically:

  • While I was waiting for my talk to begin I had loaded the slides in Chrome. The chrome window crashed just before I started the slideshow, and I had to switch over to using Safari, right at the last moment.
  • I was using the from IPython.display import HTML function to show screenshots, none of the screenshots showed up during the presentation.
  • I didn’t figure out how to hide cell input on slides where I would have prefered to only show the cell output. I’ve since found a post that describes how to do this, but it was too late for me.
  • I’m used to laying out concepts using shapes in Keynote, there is nothing equivalent to that in this stack.
  • I failed to correctly convert my slides to PDF. I followed the docs from reveal.js, but it just didn’t work for me.

This system is probably a step up from using LaTeX to create slides, but I don’t think it’s ready for mass market use yet. I had more success with this than with running a presenation from evernote, which I tried earlier in the year, but I’m unlikley to use this again in the near future.

I think if you were creating a very code-tutorial driven presentation then this would be a reasonable tool to consider using.

shortcuts that I use for the git command line

I use git a lot, it’s pretty complicated, and it has a lot of command line optoins that I can never remember. I’ve copied a couple of shortcuts from the web, and here are two that I use a lot. These are presented in the form of fish shell scripts.

function gl git log –graph –abbrev-commit –decorate –date=relative –format=format:’%C(bold blue)%h%C(reset) - %C(bold green)(%ar)%C(reset) %C(white)%s%C(reset) %C(dim white)- %an%C(reset)%C(bold yellow)%d%C(reset)’ –all end

This provides a nicer view on the git log, with the brancing tree also displayed.

function branches git for-each-ref –sort=-committerdate refs/heads/ git for-each-ref –sort=-committerdate refs/heads/ –format=’%(refname) %(committerdate) %(authorname)’ | sed ‘s/refs\/heads\///g’ end

This shows two lists of current brances in the repo, in reverse chronological order. The first list provides the full sha, and the second list shows the last commit message. This is useful when coming back to repo that has a few branches, to help you get an overview of the activity that’s happened in the different branches.

I put these examples into a gist file

looking for ideas for our wikimania talk on open scholarship tools

Inspired somewhat by the aweome http://sciencetoolbox.org, along with Martin Fenner, we proposed a session for the upcoming Wikimania conference. We will be talking about Open Scholarship Tools on Sunday the 10th of August at 9:30 am. In our outline for our talk we have decided to possibly think about:

  • CrossRef API (and possibly also the DataCite and ORCID APIs)
  • Pandoc
  • Rstudio
  • Zotero
  • iPython Notebook
  • Plotly
  • Datawrapper

What do you think we should try to cover in our 30 minute slot?