thoughts on the ERC data workshop
Wed Sep 24, 2014
On Thursday and Friday of last week I attended a European Research Council workshop on managing research data. It was well attended with about 130 participants brining views from across the academic disciplines. I’ve blogged my raw notes from day one and day two. In this post I reflect on the points I noticed that were raised over the two days. People have been talking about the increasing importance of research information for many years now, and a hope was raised in the opening comments that we might be able to provide solutions to the problems posed by the issues of research data, by the end of the workshop. I was skeptical about our chances of doing that. The risk at a meeting like this is that the same points and problems get regurgitated, problems are listed at too high a level, everyone calls on everyone else, or at least someone else, to step in and solve the problem. There were aspects of all of these issues, but there were also highly encouraging signs too, and signs of real progress in solving some of the perennial existential questions of research data. Over the course of the two days I made a note, when I noticed it, of when specific named issues, potential solutions, or novel points, were made.
By the end of the first day the problems list far outweighed the solutions list, but by the end of the second day that ratio had reversed. I’m going to briefly drill into each one in a moment, but before doing that I’ll touch on the highlights coming out of the meeting.
By the end of the meeting the chair put it well when he said that overall the feeling coming out of the meeting was one of unity, a shared desire and understanding that data should be open, and a shared understanding that some culture change is necessary. We have many parties interested in this issue, and we all want to move faster on the issue.
There were signs of real progress too. LERU have a working paper on research data, and the take home message is that university chancellors almost universally think that research data should be made open, and that this will be a high priority issue for them - once they figure out what they are doing about open access.
How to cite data is now solved, in principle. The FROCE 11 data citation principles solve this, what remains is implementation (already in progress in the life sciences), and then adoption. Adoption is going to be where the largest challenges lie, because if we have a mechanism for citing data, and researchers continue to turn up to meetings like this say how do I cite data, then obviously there is work to do. We have to continue this work until researchers turn up to meetings like this one and say this is how I cite data. We want data citations everywhere.
A working solution to how researchers can make claims on what data they have produced was demonstrated by Sünje Dallmeier‐Tiessen with the The ODIN project. Again there is work needed here to promote adoption, and work to do on usability and interoperability.
It wasn’t all light and harmonious music though, there were a few telling shadows, a few indicators that the problem remains a deep and challenging one. It was notable that no LERU university has any reward system or prize system in place for good use or reuse of research data, or any mechanism in place for rewarding excellence in the support systems for research data. There is a Dutch prize on this topic, but it’s clear that more can be done.
In fact, often a need in culture change was mentioned. It should be obvious where this change can best be affected - in the grant rewarding process and in the hiring process. The EU, indeed all funders, are wary of sticks, but let’s sow the fields of Europe’s rich plots of data with an abundance of carrots. Let’s make available specific funding to support bottom up approaches to training for data management. There is already an appetite with initiatives like software carpentry, the creation of figshare, the growth of data dryad. Goodness, we could even invest in library infrastructure for this purpose. Let’s set up a research track for pure data re-use with grants awarded to those who have projects that reuse the data of others, and give them the time and resources to clean up that data. Let’s make clear that data are a real research output that counts in assessment. There would be no requirement to do this, but researchers who did would have their work recognised where it matters most. There were a few calls that anywhere between 5% or 15% of all research funding should go to data management, but I think it would be better to look at how we can alter behaviours on the ground from the bottom up. Data is important. After all, the data is the science, or at the very least it is the embodiment of our articulation of how we have grappled with reality, and it is the trail that shows our direct engagement with nature.
Having real options for data management careers in research could also help in the short term, and in the medium term could help create a workforce that is skilled in the management of big data.
OK, so let’s now look at each of the points that I captured from the two days at the meeting. I’ll list these as either questions, problems or solutions. I’m grouping them into topics that seem to make sense to me, so my groupings don’t reflect the order in which these topics arose, but I hope by doing this I can provide a horizontal view across the breakout sessions from the meeting to get the common themes that emerged. I’ll list the solutions as they were proposed. They stand here for your consideration. I add my own commentary at the bottom of each section.
Moolah, money, cash,
Problems or questions raised
- Question: who pays, what do they pay for?
- Solution: provide funding for data sharing. - Solution: take a percent, say 15%, and set that aside in every grant for data sharing, curation and storage . - Solution: do bulk purchasing from providers, and distribute compute and storage credits to researchers.
The inference seemed to be that it should mostly be funders supplying the cash through some mechanism. The idea of doing bulk purchasing for infrastructure, and then giving researchers credits is an appealing one. Such approaches will be good for big data, but will have little impact on the majority of instances of data that is created, things like individual excel files on an individuals computer.
Infrastructure and support
Problems or questions raised
- Problem: hiring domain experts, e.g. DBAs, on a temporary basis is hard. - Problem: formats need to be updated, needs to work in the long-term, needs intelligent curation, structures to support this does not exist. - Problem: manual labour is required, the current credit system does not support that. - Question: how might we provide more training. - Problem: data is useless without the computational infrastructure behind the data. - Question: how might we provide better infrastructure for data.
Solution: have a bank of data domain experts ready in the library/institution that can be seconded out or hired on short term contracts by researchers Solution: create a profession of data curators Solution: The EU should take care of infrastructure EU-wide to promote a level playing field.
All of these solutions are good in principle, however they will require real will to create these kinds of incentives. It was often mentioned throughout the meeting that these kinds of skills could be provided through the private sector, and there was real concern that such an approach might lead to restrictions on the data if that data becomes controlled by a private company. Academic publishers were mentioned. I find it hard to see an EU-wide rolling out of an army of data curators, I think that has to come bottom up, from within disciplines. I could see libraries making a case to equip themselves for this task, but I don’t see them as being the natural inheritors of that task. It seems that institution-wide facilities, or national facilities might be good places for these kinds of roles to reside.
Problems or questions raised
Problem: open data has been defined by rich labs, it's ambiguous, and currently non-inclusive. Question: how can we get to an agreed understanding of what Open Data is, and what currency it has in research communication. Question: how might we define data, per discipline.
No solutions were proposed for this topic, but a great point was made that what we are calling data sharing is really effectively data dissemination, as those making their data available are not usually waiting for some reciprocal piece of data (although to be fair that was the example for the motivation behind sharing geneomic data).
Management and interoperability
Problems or questions raised
Question: how do we combine heterogeneous data from within one discipline or study. Question: how do you deal with large unstructured data sets? Question: who sets the metadata structures in different communities? Question: where do we put our data? Problem: need coordination between different data repositories and related services.
Solution: deposit you data into existing structured DBs where they are available. Solution: convince people to copy good data management plans (and follow them). Solution: create an open marketplace of good data management plans. Solution: make data management plans be a living document. Solution: include the data scientist at the point of experimental design.
So these topics, where do I put my data, how, how does it interoperate. To make anything happen we have to look at things on a discipline by discipline level. Many disciplines have this nailed, and we must must must must must work as hard as we can to get appropriate data into it’s appropriate repository. If it goes anywhere else that piece of data might as well not exist. I had a detailed conversation about the feasibility of federating computation over these kinds of data sets across repositories, but at the moment there is no infrastructure or will to support an approach like that, however we don’t have to, because the primary home for that data exists.
For small scale - one off files- data that is important on a paper by paper basis, there are now stable DOI coining repositories. If we can get people to use them then the question of where this data should go seems to have a clear answer.
Two areas that represent significant challenges are domains that are just now stumbling into the era of unmanageably large data owing to tooling or data sources suddenly being able to produce more data than these domains had previously needed to work with. Domains that look at web scale data, and domains whose experimental equipment has vastly increased it’s output volume are examples, such as the digital humanities and microscopy. These research fields need to work hard on building up a shared infrastructure and data formats, and those efforts need to be supported.
The other area that represents a challenge is data which is heterogeneous, but needs to be integrated in order to tell a story. Before this workshop I’d not appreciated that this kind of complexity can live even within one experiment. Perhaps a research objects like approach, or an approach of adopting tools and methods from schema-less data stores and key-value stores in web applications, could be applied to these problems, I just don’t know enough to have a solid opinion on this yet.
Incentives, trust and ethics
Problems or questions raised
Question: how do you reward data dissemination? how do you provide incentives. Question: how do we deal with data fraud. Question: can you trust the data in a repository? Question: how might we provide a proper citation and reward system. Question: how do we make data donation habitual?
Solution: give DOIs, or similar, to data. Solution: cite data in reference lists, use the [FORCE 11 data citation principles](https://www.force11.org/datacitation). Solution: reward data contributions . Solution: appropriately label datasets to support fine-grained attribution. Solution: develop a culture of acknowledgement. Solution: use embargoes as a mechanism to incentivise researcher to make timely use of their own data. Solution: give a prize for examples of good use of data (it's mention that there is a data prize in The Netherlands). Solution: provide certification for digital repositories. Solution: Funders should mandate open data. Solution: enforce data policies. Solution: create a code of conduct teaching young researchers about the ethical issues around data.
A lot of people talked about how we can cite the data. Yo, HELLO!!, we already have a solution to this, you just have to cite the data. A functioning example of how a researcher had connected a data output from one paper to their ORCID profile was even demonstrated during the meeting. For the vast majority of use cases, this is technically solved, we just need to let people know that it’s solved. Indeed the following blog post from CrossRef describes how to intertwine data and literature citations.
There are, of course, some subtitles, what do you do with a data source or DB that has multiple contributors, what if your data source is evolving over time. There are potential solutions to these issues on the horizon, I’m excited to see where the DAT project gets to. The aim of this project is to allow the creation of fine-grained identifiers (along with contributor info) for every record in a DB. It’s basically git for data.
On the broader topic of giving credit for data outputs where those outputs can be identified, that is where the cultural change needs to come, and ideas such as setting up prizes or named chairs for data reuse are really good ones. In fact looking over the proposed solutions, most are indeed carrot-like rather than stick-like.
On the topic of ethical responsibility and data fraud, I love the idea that by making your data available, that is a huge dis-incentive to fraud. Reproducing experiments is hard, so even if your data is made available, your experiment might not be that easy to reproduce, but fake data tends to have a statistically significantly different signature to real data, and so the act of making your data available is an act of ethical responsibility.
Legal and commercial issues
Problems or questions raised
Question: how does the distance to the commercial market affect acceptance of and practices in data sharing. Question: can we introduce licences that can be interoperable for data? Question: who owns the data/a bacterium? Question: legal and ethical issues affect the use of such data.
Solution: move towards an internationally level playing field on ethics for research. Solution: update the EU copyright directive. Solution: create an EU-wide directive on data policy for scientific research.
It seems there are movements within the vast machine of EU legality to try to get to some state of normalisation on these issues, and I wish them the very best. One response that was clear from the floor of the conference around commerical involvement was the clear call to not give ownership of the data away to the private sector in the same way that ownership of the literature has been ceded to commerical publishers. In contrast to this the fact that data that is currently commercially held can be of high value to research was mentioned eloquently by the speaker who is building climate change models, and I think that position strengthens the arguments of the Open Data movement – even commerical data providers should be encouraged EU-wide to move to thinking about getting open data certification for their data.