ERC data management workshop, day 1

Sun Sep 21, 2014

3790 Words

Tags: data, open-access, EU, ERC, data management, publishing

initial thoughts about the workshop.
- Opening remarks.
- Setting the scene.
  - Sabrina Leonelli - the epistemology of data-intesive science.
  - [Dr Hans Pfeiffenberger - Open Science – opportunities, challenges … @datasciencefeed.](#dr-hans-pfeiffenberger-open-science-opportunities-challenges-datasciencefeedhttpstwittercomdatasciencefeed)
  - Bernd Pulverer - finding and accessing the data behind figures.
  - Dr Roar Skålin - Norwegian researchers want to share, but are afraid of jeopardising their career.
  - Summary of points from the scene setting.
- Afternoon breakout session - Life Sciences.
- Summary of issues from the workshop.
- Summary of solutions proposed by the workshop.
- Hallway conversations.
- Bingo card terms

These are notes from the first day of the European Research Council Research Data Management & Sharing Workshop. I’ve also posted notes from the second day, and I’ll shortly add another post examining problems and potential solutions raised over the course of the workshop. Jennifer Lin from PLOS has also posted some excellent notes.

These notes are bit jagged, but I thought there was more value in getting them out in a rough form ahead of the RDS meeting that starts tomorrow, rather than waiting to get them into better shape, but missing that event. My apologies up front for errors, and incomplete sentences.

# initial thoughts about the workshop.

The opening document, that was distributed a few days before the workshop, highlights the great heterogeneity in how data is used, understood and licensed, across different disciplines. It’s a big old gordian knot. I advocate doing small simple things that move us, step by step, into a better future.

I will be keeping an ear out for tools that are in use in real workflows, and I’ll be keeping an ear out for any comments that float up during the course of the meeting that resonate for one reason or another. In principle this meeting is about the EU listening to the research community, and other stakeholders, and hearing what it is that we want as an appropriate future for how data should be managed.

(I’ll see if the docs can be posted somewhere, for the purposes of this blog.)

Opening remarks.

In addition to the normal remakes on heterogeneity, Professor Nicholas Canny made the excellent point that within the EU business model viability is also a real issue within the EU. What works in one country does not always work in another.

The nub of Prof. Canny’s remarks is that sustainability is the key issue, not only in terms of storage, but also in terms of verification and validation of the data at a later point in time.

A use case of interest about digital preservation and sharing is the Boston archive of the IRA recollections. This was collected with the promise that no material would be released until later, but then …. they were subpoenaed.

There are problems, and this conference is addressing those problems. The hope is that we can provide solutions to those problems.

The next speaker also focusses on publications, as the main route towards data. It’s mentioned that some disciplines have been very self-organising, however some disciplines are even lacking recognition that there is currently an issue. It would be interesting to me to find out which disciplines are lagging the most.

Sustainability also touches on the software, so it’s all about wares - hardware for storage, software for interoperability and wetware for expertise.

We are informed that the agenda is both heavy and efficient. I suppose like an elephant. It’s mentioned again that there is a hope that this workshop will provide solutions. I am doubtful that we will manage that, however if we can identify some small roadblocks, perhaps that might be sufficient, perhaps that will give us a few points against which we can apply some levers.

## Setting the scene.

(It might be good for me to try to capture specially interesting questions that emerge in these opening sessions, we can then review later and see if there are common themes.)

Sabrina Leonelli - the epistemology of data-intesive science.

(who is very very awesomely speaking with her very young baby on her shoulder, which is just awesome).

Q: how do we make data donation habitual?

Point: manual labour is required, the current credit system does not support that.

Point: formats need to be updated, needs to work in the long-term, needs intelligent curation, structures to support this does not exist.

Q: how might we create structures and systems to support data curation, and intelligent curation.

(on the question of how to update data, dat is a very nice potential solution, as are data packages, but these are both in an embryonic state. The reason that Dat is appealing is that it matches the model of git, and we know that git is successful, we know that git supports more items of code, in a way that is sustainable, reusable, and shareable, at a scale that currently dwarfs that number of researchers that are curating their own data sets, so if we had a system that could do the same for data that was provably as robust as git, there is an inference that we could make that such a system might be fit for purpose for the scholarly world).

Point: open data has been defined by rich labs, it's ambiguous, and currently non-inclusive

Q: how can we get to an agreed understanding of what Open Data is, and what currency it has in research communication

Sabrina makes the very interesting point that people are not very clear in their understanding of what is being shared, when they share their data, what do they get back in return. Sharing should be a reciprocal activity and what we do with research data is not a reciprocal activity. She prefers the term dissemination.

Dr Hans Pfeiffenberger - Open Science – opportunities, challenges … @datasciencefeed.

Everyone is afraid of data publishing, but who should be afraid? The people who make data up should be afraid of sharing their data. (the bottom line here is that researchers with shit practices should be afraid of data sharing, the inverse inference is that if you are afraid of sharing your data you might be considered to be a shit research, however that’s a stretch, and I want to be clear that Dr. Pfeiffenberger does not make this inference).

Q: how might we define data, per discipline

Q: how do you reward data dissemination? how do you provide incentives

Mentions Argo as a great example of an open data project. (I’d not seen argo before, it’s amazing). Dr. Pfeiffenberger gives examples of where making the data open had doubled the research output.
(The royal society report on science as an open enterprise is mentioned again).

NSF have started to ask for a list of five products from researchers, where the citable product can include a data set, and not only a research paper.

DFG rules have been amended to say that you can be an author of a paper if you contributed to the creation of some data.

A recommendation is made to not sign bad contracts. A recommendation is also made to fight against numerical assessment practices (this relates to DORA).

Bernd Pulverer - finding and accessing the data behind figures.

The data behind figures is critical, it’s currently hard to get to, EMBO is working on improving this situation.

Bernd emphasis that not all data is useful. Raw and unstructured data age rapidly.

Q: how do we deal with scooping?

(here are some slides on the source data project). A key thing they are working on is tagging and identifying information in papers at the figure panel level to identify methods, entities and authors of individual components of a figure. This will allow horizontal navigation based on data rich and resource rich facets.

Q: how do we deal with data fraud

Do we open the gates to heaven or to hell?

They surveyed researchers, and got responses from 1474 researchers, a response rate of just over 30%. This was statistically representative. A large number of researchers actively decided to opt out from the survey.

40 - 50% of researchers state that data is available, but on request.

Researchers in this system broadly reflect the concerns that we have seen from other studies, concerns about scooping, about misinterpretation, and the time and effort required to

Q: how might we provide a proper citation and reward system

Q: how might we provide more training

Q: how might we provide better infrastructure for data

Q: how is infrastructure organised, nationally, via publishers, via institutions, independent entities

Q: bottom up or top down?

Money is not the number one concern, more concern over infrastructure and training.

Most researchers are in agreement that data should be provided on publication.

Summary of points from the scene setting.

Q: how do we make data donation habitual?

Point: manual labour is required, the current credit system does not support that.

Point: formats need to be updated, needs to work in the long-term, needs intelligent curation, structures to support this does not exist.

Q: how might we create structures and systems to support data curation, and intelligent curation.

Point: open data has been defined by rich labs, it's ambiguous, and currently non-inclusive

Q: how can we get to an agreed understanding of what Open Data is, and what currency it has in research communication

Point: who pays, what do they pay for?

Q: how might we define data, per discipline

Q: how do you reward data dissemination? how do you provide incentives

Q: how might we define data, per discipline

Q: how do you reward data dissemination? how do you provide incentives

Q: how do we deal with data fraud

Q: how might we provide a proper citation and reward system

Q: how might we provide more training

Q: how might we provide better infrastructure for data

Afternoon breakout session - Life Sciences.

Short comments

These are going to be quite short, and then we will kick into discussions, so I’m going to aim to only outline these comments.

Iris Hovatta, Group Leader, Department of Biosciences, University of Helsinki.

She studies anxiety, and uses mice to study gene regulatory networks.

Most of their data is stored in excel files. Standardisation seems to be challenging. Mice in different labs seem to behave differently, even if genetically identical.

They also product RNA-seq data. This kind of data is well supported, many journals require it’s availability.

(On the excel data, I wonder whether they have experimented with any plugins, and how did they find them?).

They want to integrate their behavioural and expression data together. There is a lack of expertise in their lab for database construction. Hiring DBAs on a monthly basis is hard.

Point: hiring domain experts, e.g. DBAs, on a temporary basis is hard

Solution: have a bank of data domain experts ready in the library/institution that can be seconded out or hired on short term contracts by researchers

Q: how do we obtain informed consent?

### Dr Bouke de Jong, Head of Unit Mycobacteriology at the Institute ofTropical Medicine, Antwerp.

Studying TB, and TB transmission. They look at infection by comparing the genotypes of the bacteria from different patients within a study area. The geneotypic clustering is a proxy for recency of infection.

Challenges again reside around complex and large data sets, combining demographic and genomic data.

Q: how do we combine heterogeneous data from within one discipline or study

They have non-automated data collection issues. Cleaning the data is labor intensive. They are creating dedicated DBs.

If doing again they would like to enter data in the filed using bar-code methods to avoid double entry.

They have clustering at multiple levels.

(publishable unit is mentioned for the first time).

Q: where do we put our data?

Q: who owns the data/a bacterium?

In routine diagnostics patients have not given explicit consent. You might say that with anonymised data this might be OK, but the question of ownership of the bacteria remains an open question.

How do you provide service to the community like this (of giving data away) count as a key performance indicator when it comes to evaluation?

Q: what's the incentive?

(Narrative seems to be an issue, if the researcher can construct that narrative within a paper and we can tightly link back to those narratives from the data, would that help?)

Jo McEntyre asks a question about why not deposit some of this interesting MD with the core “genetic data”.

We have a quick conversation around what we do with “unstructured” data at the point of publication. If we got that working, that might be a route to help with this issue, but it might lead to more noise. This needs to be worked out a bit more.

There is a question about anonymising the data

### Sebastian Luyssaert - user and provider perspectives on data sharing at the interface between life and earth sciences.

He looks at managing forestry, with a view of looking at how that can affect greenhouse gases and climate change. They have a model with about 500k lines of code. They run this model on a big computational infrastructure.

One of the data sets that they use is forest inventory data from the EU. There are about 400k data points. This is economic data and is therefore hard to get access to. This data is held at the national level. They need to contact 30 bodies. It is very labor intensive to get this data.

Problem: data is useless without the computational infrastructure behind the data

Solution: share data and operational algorithms (data cubes)

He mentions fluxnet as an example of a community effort. Data sharing is doing via a fair-use policy, where the data is made available on request, but not all data are shared, so you still need to contact the PI to get the data.

Solution: make data sharing mandatory, like in [ICOS](http://www.icos-infrastructure.eu) & [NEON](http://www.neoninc.org/science/data)

He mentions that the old-school method has the advantage the it forces conversations between researchers, and collaborations emerge.

Where data is taken from an environment where there is a lot of competition, it is better for you to share your data, as if you don’t people will work with the others who have similar data that are sharing it. If you have data from a location that is hard to get this data, then it is better to not share your data, as people will contact you anyway, and you will get to be a co-author.

ESA recently realised that no-one was using their data. The reason is that NASA data was free and ESA data was expensive. ESA data is now free.

Data is so large that to do data sharing they buy disks and post it around.

There is no way to talk to people at NASA or ESA about how the data was produced.

After they improve their models they become more of a software provider than a data provider. At this point he is struggling a lot with conversations in the

130 person months will end up in one paper, in doing a large extension to the underlying software model. There is no tradition in sharing software, however the only way people will be able to do experiments in the future is that they will be able to use the software that this group created. They want to get credit for this work.

They are looking to distribute the model with a fair use policy. They have this software under version control, but how does this match up to IP? They are thinking of breaking the model up into components, and each component could come with a list of data and contributors to that component.

Q/Problem how do you give credit for software

He is worried that in the next year he is going to have to take up a lot of time in cleaning up his data and code, in order to be in a position of getting his next grant. He is more interested in doing science than spending time on cleaning his data sets.

Discussion

We get into a really good conversation about tools, competency and training, Software Carpentry and Data Carpentry are mentioned.

We gather a list of potential issues with data in this domain. Someone mentions the issue of access to materials, or even accurate description of materials, such as reagents.

Data management plans are mentioned as potentially problematic in terms of overhead. The worry is that a data management plan is something people write at the start, but then they ignore them after getting the money. Thinking about them as a step for submitting proposals could be a problem. Leaving them to the level of institutions can be problematic. Could one think of a system where the plan is discussed before and after submission, and the funder plays a role of coordinating data management at a super-institutional level. It’s mentioned that researchers are now being asked to review data management plans, but often don’t feel qualified to make these peer review decisions. It’s mentioned that in the UK the humanities council has a special committee for reviewing technically heavy applications.

A question is raised about licensing of facts, it’s pointed out that you can’t licence facts.

All of the UK funders have said that it is acceptable to ask for funding for data provisions, but you can’t ask for a blank cheque, you have to justify the request in the grant. This automatically interacts with the institutional level, as you will eventually end up interacting with whatever resources are available at your institutions.

It’s mentioned that it’s a mistake to consider infrastructure as only hardware. It needs to expand it’s definition to include skills.

The issue of rewards and incentives is mentioned. Bernd mentions that making data available can help with discoverability. Making data required at point of publication is mentioned as a mechanism (but to be honest the researchers do not seem convinced).

We ask the researchers what incentives they need to see to become more open to the idea of sharing, we get a variety of answers

- seeing that people who share are more successful
- knowing that a shared body of knowledge can provide more power in terms of making scientific advances (when I can see more data through the act of sharing my data)
- already has benefitted from sharing, got 80% good experience, 20% bad experience, but is mostly concerned that if he has to make all of his data available it will be too much of a burden, will take time away from doing science.

(transmitting data is mentioned again, I wonder about sending computation to the data, rather than the other way around).

A comment is made that big data sets can be expensive to store, up to 30K for two years of storage. This can freeze out younger researchers. (Jo makes mentions again that we have places to put some data, but our systems do not cover all data types at this point in time).

We devolve into writing a power point slide via committee.

### Summary of issues from breakout session.

Point: hiring domain experts, e.g. DBAs, on a temporary basis is hard

Q: how do we obtain informed consent?

Q: how do we combine heterogeneous data from within one discipline or study

Q: where do we put our data?

Q: who owns the data/a bacterium?

Problem: data is useless without the computational infrastructure behind the data

Q/Problem how do you give credit for software

Problem: access and description of materials is often poor

Summary of solutions from breakout sessions.

Solution: have a bank of data domain experts ready in the library/institution that can be seconded out or hired on short term contrasts by researchers

Solution: share data and operational algorithms (data cubes)

Solution: make data sharing mandatory, like in [ICOS](http://www.icos-infrastructure.eu) & [NEON](http://www.neoninc.org/science/data)

Solution: docker or vagrant

Summary of issues from the workshop.

Point: hiring domain experts, e.g. DBAs, on a temporary basis is hard

Q: how do we make data donation habitual?

Point: manual labour is required, the current credit system does not support that.

Point: formats need to be updated, needs to work in the long-term, needs intelligent curation, structures to support this does not exist.

Q: how might we create structures and systems to support data curation, and intelligent curation.

Point: open data has been defined by rich labs, it's ambiguous, and currently non-inclusive

Q: how can we get to an agreed understanding of what Open Data is, and what currency it has in research communication

Point: who pays, what do they pay for?
Q: how might we define data, per discipline

Q: how do you reward data dissemination? how do you provide incentives

Q: how might we define data, per discipline

Q: how do you reward data dissemination? how do you provide incentives

Q: how do we deal with data fraud

Q: how might we provide a proper citation and reward system

Q: how might we provide more training

Q: how might we provide better infrastructure for data

Q: how do we combine heterogeneous data from within one discipline or study

Q: where do we put our data?

Q: who owns the data/a bacterium?

Problem: data is useless without the computational infrastructure behind the data

Summary of solutions proposed by the workshop.

Solution: have a bank of data domain experts ready in the library/institution that can be seconded out or hired on short term contracts by researchers

Solution: share data and operational algorithms (data cubes)

Hallway conversations.

I chatted briefly with an engineer over coffee. She described some of the data that they deal with when looking at modelling the potential effects of building tunnels under a city, and that effect on the buildings on the ground.

Bingo card terms

Lot’s of topics come up again and again, so I’ve quickly created a small set of data sharing bingo cards. I’ve used the following terms:

who pays

relation to open access ?
royal society report
privacy concerns
the humanities are different
data standards
embargos
how do we cite data
data quality
big data
unreproducable science
legal restrictions
licensing
I’ll get scooped
no time to share
my data will be misunderstood
there is no infrastructure
my data is sensitive
bottom up
top down
sustainability
PLOS
discoverability
incentives
data citation
publishable unit
supercomputer
anonymise the data

This work is licensed under a Creative Commons Attribution 4.0 International License