Thu Mar 16, 2017
This week I attended futurepub10, I love these events, I’ve been to a bunch, and the format of short talks, and lots of time to catchup with people is just great.
# A new Cartography of Collaboration - Daniel Hook, CEO Digital Science (work with Ian Calvert).
Digital science have produced a report on collaboration, and this talk was covering one of chapters from that.
I was interested to see what the key takeaways are that you can describe in a five minute talk. This talk looked at what could be inferred around collaboration by looking at co-authors actually using the Overleaf writing tool. It’s clear that there is an increasing amount of information available, and it’s also clear that if you have a collaborative authoring tool you are going to get information that was not previously available by just looking at the publication record.
Daniel confirmed they can look at the likely journals for submission, based on the article templates, how much effort in time and content that each author is providing to the collaboration, how long it takes to go from initial draft to completed manuscript, which manuscripts end up not being completed. There is a real treasure trove of information here. (I wonder if you can call the documents that don’t get completed the
dark collaboration graph).
In addition to these pieces of metadata there are the more standard ones, institute, country, subject matter.
In spite of all of the interesting real-time and fine grained data that they have, for the first pass info they looked at the country - country relations. A quick eyeballing shows that the US does not collaborate across country boundaries as much as the EU does. The US is highly collaborative within the US.
Looking at the country to country collaboration stats for countries in the EU I’d love to see what that looks like scaled per researcher rather than weighted by researchers per country, are there any countries that are punching above their weight per capita?
In the US when you look at the State to State relations California represents a superstate in terms of collaboration. South Carolina is the least collaborative!!
The measures of centrality in the report is based on document numbers related to collaborations.
The data that generates the report is updated in real time, but it seems like they don’t track it in real time yet. (It seems to me that this would really come down to a cost benefit analysis, until you have a key set of things that you want to know about this data you probably don’t need to look at real time updates.). Daniel mentions that they might be able to begin to look at the characteristic time scale to complete a collaboration within different disciplines.
In terms of surprise there was the expectation in the US that collaboration would be more regional than they saw (my guess is that a lot of the national level collaboration is determined by centres of excellence for different research areas, a lot driven by Ivy League).
Someone asks if these maps can be broken out by subject area. It seems that it’s probable that they can get this data, but the fields will be biased around the core fields that are using by Overleaf.
This leads to an interesting question, how many users within a discipline do you need to get to get representative coverage for a field (when I was at Mendeley I recall we were excited to find that the number might be in the single digit percentages, but I can’t recall if that still holds any more, nor why it might.).
Someone asks about the collaboration quality of individual authors. Daniel mentions that this is a tricky question, owing to user privacy. They were clear that they had to create a report the didn’t expose any personally identifiable information.
I think that they are sitting on a really interesting source of information, and for any organisation to have information at this level, especially with the promise of real time updates, that’s quite exciting, however I’m not convinced that there is much extra information here than you would get by just looking at the collaboration graphs based on the published literature. This is what I’d love to see, can you evidence that the information you get from looking at real time authoring is substantively different than what you would get by mining the open literature? Doing this kind of real time analysis is probably only going to happen if Overleaf see a direct need to understand their user base in that way, and doing that is always going to need to be traded off against other development opportunities. Perhaps if they can find a way to cleanly anonymise some of this info, they could put it into the public domain and allow other researchers to have a shot at finding interesting trends?
The other papers in the report also look interesting and I’m looking forward to reading through them. The network visualisations are stunning and I’m guessing that they used gephi to derive them.
Peerage of science provides a platform to allow researchers to get feedback on their manuscripts from others (reviewing) before submission, and allows them to get feedback on how useful their reviews are to others. A number of journals participate to allow easy submission of a manuscript along with review for consideration for publication.
Janne is emphasising that the quality of the peer review that is generated in his system is high. These reviews are also peer evaluated, on a section by section base.
Reviewers need to provide feedback to each other. This is a new element to the system, and according to Janne the introduction of this new section in their system has not negatively affected the time to complete the review by any significant factor.
75% of manuscripts submitted to their system end up eventually published. 32% are published directly in the journals that are part of the system. 27% are exported to non-participating journals.
The reason why people take part in reviewing is that they can get a profile on how good their reviews are from their colleagues, building up their reviewing profile.
Is there any evidence that the reviews actually improve the paper? The process always involves revisions on the paper, but there is no suggestion that there is direct evidence that this improves the paper.
Really, anything that helps to improve the nature of peer review has to be welcomed. I remember when this service first launched, and I was skeptical back then, but they are still going, and that’s great. In the talk I didn’t catch how much volume they are processing. I’m keen to see many experiments like this one come to fruition.
Discover what’s been missing, Vicky Hampshire, Yenow
Yenow uses machine learning to extract concepts from a corpus, and then provides a nifty interface to show people the correlation between concepts. These correlations are presented as a concept graph, and the suggestion is that this is a nice way to explore a space. Specific snippets of content are returned to the searcher, so this can be used as a literature review tool.
I had the pleasure of spending an hour last week at their headquarters in Redwood California having a look at the system in detail, and I’ll throw in some general thoughts at the bottom of this section. It was nice to see it all presented in a five minute pitch too. They do no human curating of the content.
They incorporated in 2014, is now based in California, but the technology was created in Kings in London. As I understand it the core technology was originally used in the drug discovery realm and one of their early advisors Mike Keller had a role in alerting them to the potential for this technology in the academic search space.
The service is available through institutional subscription and it’s been deployed at a number of institutions such as Berkeley, Stanford and the state library of Bavaria (where you can try it out for yourself.)
To date they have indexed 100M items of text and they have extracted about 30M concepts.
Are they looking at institutions and authors? These are things that are on their roadmap, but they have other languages higher up in their priorities. They system won’t do translation, but they are looking for cross-language concept identification. They are interested in using the technology to identify images and videos.
They do capture search queries, and they have a real time dashboard for their customers to see what searchers are being made. They also make this available for publishing partners. This information is not yet available to researchers who are searching.
They are also working on auto-tagging content with concepts, and there is a product in development for publishers to help them auto-categorise their corpus.
They are asked what graph database they are using. They are using DynamoDB and elasticsearch, but Vicky mentioned that the underlying infrastructure is mostly off the shelf, and the key things are the algorithms that they are applying.
At the moment there is no API, the interface is only available to subscribing institutions. The publisher system that they are developing is planned to have an API.
There is a lot to unpack here. The scholarly kitchen recently had a nice overview of services that are assembling all of the scholarly content, and I think there is something here of great importance for the future of the industry, but what that is is not totally clear to me yet.
I’m aware of conversations that have been going on for some years now about wanting to see the proof of the value of open access through the development of great tools on top of open content, and as we get more and more open access content the collection of all of that content into one location for further analysis should become easier and easier, however yenow, along with other services like meta and google scholar, have been building out by working on access agreements with publishers. It’s clear that the creation of tools built on top of everything is not dependent on all of the content being open, it’s dependent on the service you are providing being not perceived as threatening to the business model of publishers.
That puts limits on the nature of the services that we can construct from this strategy of content partnerships. It’s also the case that for every organisation that wants to try to create a service like this, they have to go through the process of setting up agreements individually, and this probably creates a barrier to innovation.
Up until now many of the kinds of services that have been built in this way have been discovery or search services, and I think publishers are quite comfortable with that approach, but as we start to integrate machine learning, and increase the sophistication of what can be accomplished on top of the literature, will that have the potential to erode the perceived value of publisher as a destination? Will that be a driver to accelerate the unbundling of the services that publishers provide. In the current world I may use an intermediate search service to find the content that may interest me, and then engage with that content at the publisher site. In a near future world if I create a natural language interface into the concept map, perhaps I’ll just ask the search engine for my answer directly. Indeed I may ask the search engine to tell me what I ought to be asking for. Owing to the fact that I don’t have full overview of the literature I’m not in a position to know what to ask for myself, so I’ll rely on being told. In those scenarios we continue to disrupt the already tenuous relationship between reader and publisher.
There are some other interesting things to think about too. How many different AI representations of the literature should be hope for? Would one be just too black boxed to be reliable? How may we determine reproducibility of search results? how can we ensure representation of correlations that are not just defined by the implicit biases of the algorithm? should we give the reader algorithmic choice? Should there be algorithmic accountability? Will query results be dependent on the order in which the AI reads the literature? Many many many interesting questions.
The move to do this without any human curation is a bold one. Other people in this space hold the opinion that this approach currently has natural limits, but it’s clear that the Yenow folk don’t see it that way. I don’t know how to test that, but maybe as searches on the platform become more focussed, that’s the moment where those differences could come to light.
I do have some comments on the product itself. I spent a little time today using the demo site available from the state library of Bavaria. It strikes me that I would quite like to be able to choose my own relevance criteria so that I can have a more exploratory relationship with the results. I did find a few interesting connections through querying against some topics that I was recently interested in, but I had the itch to want to be able to peel back the algorithm to try to understand how the concepts were generated. It’s possible that this kind of search angst was something that I experience years ago with keyword search, but that years of practice have beaten the inquisitiveness out of me, but for now that is definitely something that I noticed while using this concept map, almost a desire to know what lies in the spaces between the connections.
At the moment they are looking to sell a subscription into libraries. It’s almost certain that this won’t totally replace current search interfaces (that sentence might come back to haunt me!). The challenge they face in this space is that they are Yet Another Discovery Interface, and people using these tools probably don’t invest a huge amount of time learning their intricacies. On the other hand the subscription model can be monetized immediately, and you don’t have to compete with Google head to head.
On a minor note looking at their interface there is an option to sign in, but It’s not clear to me why I should. I imagine that it might save my searches, that it might provide the opportunity for me to subscribe to some kind of updating service, but I just can’t tell from the sign up page.
CrossRef Event Data - Joe Wass - @JoeWass
By this stage in the evening the heat was rising in the room, and the jet lag was beginning to kick in, so my notes start to thin out a lot. Joe presented some updates on the CrossRef event data service. It was great to see it live, and I’d love to see it being incorporated into things like altmetric. Perhaps they need a bounty for encouraging people to build some apps on top of this data store?
At the moment they are generating about 10k events per day. They have about 0.5M events in total.
They provide the data as CC0, and for every event in the data store they give a full audit trail
Musicians and Scientists - Eva Amson - @easternblot
Eva gave a beautiful little talk about the relationship between scientists and musicians, and that there are a disproportionally high number of scientists who play instruments than in the general population. She has been collecting stories for a number of years now and the overlap between these two activities is striking. You can read more about the project on her site and you can catch Eva playing with http://www.londoneuphonia.com on Saturday at St Paul’s Church Knightsbridge.