This week I attended futurepub10, I love these events, I’ve been to a bunch, and the format of short talks, and lots of time to catchup with people is just great.

# A new Cartography of Collaboration - Daniel Hook, CEO Digital Science (work with Ian Calvert).

Digital science have produced a report on collaboration, and this talk was covering one of chapters from that.

I was interested to see what the key takeaways are that you can describe in a five minute talk. This talk looked at what could be inferred around collaboration by looking at co-authors actually using the Overleaf writing tool. It’s clear that there is an increasing amount of information available, and it’s also clear that if you have a collaborative authoring tool you are going to get information that was not previously available by just looking at the publication record.

Daniel confirmed they can look at the likely journals for submission, based on the article templates, how much effort in time and content that each author is providing to the collaboration, how long it takes to go from initial draft to completed manuscript, which manuscripts end up not being completed. There is a real treasure trove of information here. (I wonder if you can call the documents that don’t get completed the dark collaboration graph).

In addition to these pieces of metadata there are the more standard ones, institute, country, subject matter.

In spite of all of the interesting real-time and fine grained data that they have, for the first pass info they looked at the country - country relations. A quick eyeballing shows that the US does not collaborate across country boundaries as much as the EU does. The US is highly collaborative within the US.

Looking at the country to country collaboration stats for countries in the EU I’d love to see what that looks like scaled per researcher rather than weighted by researchers per country, are there any countries that are punching above their weight per capita?

In the US when you look at the State to State relations California represents a superstate in terms of collaboration. South Carolina is the least collaborative!!

The measures of centrality in the report is based on document numbers related to collaborations.

Question Time!

The data that generates the report is updated in real time, but it seems like they don’t track it in real time yet. (It seems to me that this would really come down to a cost benefit analysis, until you have a key set of things that you want to know about this data you probably don’t need to look at real time updates.). Daniel mentions that they might be able to begin to look at the characteristic time scale to complete a collaboration within different disciplines.

In terms of surprise there was the expectation in the US that collaboration would be more regional than they saw (my guess is that a lot of the national level collaboration is determined by centres of excellence for different research areas, a lot driven by Ivy League).

Someone asks if these maps can be broken out by subject area. It seems that it’s probable that they can get this data, but the fields will be biased around the core fields that are using by Overleaf.

This leads to an interesting question, how many users within a discipline do you need to get to get representative coverage for a field (when I was at Mendeley I recall we were excited to find that the number might be in the single digit percentages, but I can’t recall if that still holds any more, nor why it might.).

Someone asks about the collaboration quality of individual authors. Daniel mentions that this is a tricky question, owing to user privacy. They were clear that they had to create a report the didn’t expose any personally identifiable information.

### Comment

I think that they are sitting on a really interesting source of information, and for any organisation to have information at this level, especially with the promise of real time updates, that’s quite exciting, however I’m not convinced that there is much extra information here than you would get by just looking at the collaboration graphs based on the published literature. This is what I’d love to see, can you evidence that the information you get from looking at real time authoring is substantively different than what you would get by mining the open literature? Doing this kind of real time analysis is probably only going to happen if Overleaf see a direct need to understand their user base in that way, and doing that is always going to need to be traded off against other development opportunities. Perhaps if they can find a way to cleanly anonymise some of this info, they could put it into the public domain and allow other researchers to have a shot at finding interesting trends?

The other papers in the report also look interesting and I’m looking forward to reading through them. The network visualisations are stunning and I’m guessing that they used gephi to derive them.

# Open Engagement and Quality Incentives in Peer Review, Janne Tuomas Seppänen, founder of Peerage of Science. @JanneSeppanen

Peerage of science provides a platform to allow researchers to get feedback on their manuscripts from others (reviewing) before submission, and allows them to get feedback on how useful their reviews are to others. A number of journals participate to allow easy submission of a manuscript along with review for consideration for publication.

Janne is emphasising that the quality of the peer review that is generated in his system is high. These reviews are also peer evaluated, on a section by section base.

Reviewers need to provide feedback to each other. This is a new element to the system, and according to Janne the introduction of this new section in their system has not negatively affected the time to complete the review by any significant factor.

75% of manuscripts submitted to their system end up eventually published. 32% are published directly in the journals that are part of the system. 27% are exported to non-participating journals.

### Questions

The reason why people take part in reviewing is that they can get a profile on how good their reviews are from their colleagues, building up their reviewing profile.

Is there any evidence that the reviews actually improve the paper? The process always involves revisions on the paper, but there is no suggestion that there is direct evidence that this improves the paper.

### Comment

Really, anything that helps to improve the nature of peer review has to be welcomed. I remember when this service first launched, and I was skeptical back then, but they are still going, and that’s great. In the talk I didn’t catch how much volume they are processing. I’m keen to see many experiments like this one come to fruition.

Discover what’s been missing, Vicky Hampshire, Yenow

Yenow uses machine learning to extract concepts from a corpus, and then provides a nifty interface to show people the correlation between concepts. These correlations are presented as a concept graph, and the suggestion is that this is a nice way to explore a space. Specific snippets of content are returned to the searcher, so this can be used as a literature review tool.

I had the pleasure of spending an hour last week at their headquarters in Redwood California having a look at the system in detail, and I’ll throw in some general thoughts at the bottom of this section. It was nice to see it all presented in a five minute pitch too. They do no human curating of the content.

They incorporated in 2014, is now based in California, but the technology was created in Kings in London. As I understand it the core technology was originally used in the drug discovery realm and one of their early advisors Mike Keller had a role in alerting them to the potential for this technology in the academic search space.

The service is available through institutional subscription and it’s been deployed at a number of institutions such as Berkeley, Stanford and the state library of Bavaria (where you can try it out for yourself.)

To date they have indexed 100M items of text and they have extracted about 30M concepts.

### Questions

Are they looking at institutions and authors? These are things that are on their roadmap, but they have other languages higher up in their priorities. They system won’t do translation, but they are looking for cross-language concept identification. They are interested in using the technology to identify images and videos.

They do capture search queries, and they have a real time dashboard for their customers to see what searchers are being made. They also make this available for publishing partners. This information is not yet available to researchers who are searching.

They are also working on auto-tagging content with concepts, and there is a product in development for publishers to help them auto-categorise their corpus.

They are asked what graph database they are using. They are using DynamoDB and elasticsearch, but Vicky mentioned that the underlying infrastructure is mostly off the shelf, and the key things are the algorithms that they are applying.

At the moment there is no API, the interface is only available to subscribing institutions. The publisher system that they are developing is planned to have an API.

### Comment

There is a lot to unpack here. The scholarly kitchen recently had a nice overview of services that are assembling all of the scholarly content, and I think there is something here of great importance for the future of the industry, but what that is is not totally clear to me yet.

I’m aware of conversations that have been going on for some years now about wanting to see the proof of the value of open access through the development of great tools on top of open content, and as we get more and more open access content the collection of all of that content into one location for further analysis should become easier and easier, however yenow, along with other services like meta and google scholar, have been building out by working on access agreements with publishers. It’s clear that the creation of tools built on top of everything is not dependent on all of the content being open, it’s dependent on the service you are providing being not perceived as threatening to the business model of publishers.

That puts limits on the nature of the services that we can construct from this strategy of content partnerships. It’s also the case that for every organisation that wants to try to create a service like this, they have to go through the process of setting up agreements individually, and this probably creates a barrier to innovation.

Up until now many of the kinds of services that have been built in this way have been discovery or search services, and I think publishers are quite comfortable with that approach, but as we start to integrate machine learning, and increase the sophistication of what can be accomplished on top of the literature, will that have the potential to erode the perceived value of publisher as a destination? Will that be a driver to accelerate the unbundling of the services that publishers provide. In the current world I may use an intermediate search service to find the content that may interest me, and then engage with that content at the publisher site. In a near future world if I create a natural language interface into the concept map, perhaps I’ll just ask the search engine for my answer directly. Indeed I may ask the search engine to tell me what I ought to be asking for. Owing to the fact that I don’t have full overview of the literature I’m not in a position to know what to ask for myself, so I’ll rely on being told. In those scenarios we continue to disrupt the already tenuous relationship between reader and publisher.

There are some other interesting things to think about too. How many different AI representations of the literature should be hope for? Would one be just too black boxed to be reliable? How may we determine reproducibility of search results? how can we ensure representation of correlations that are not just defined by the implicit biases of the algorithm? should we give the reader algorithmic choice? Should there be algorithmic accountability? Will query results be dependent on the order in which the AI reads the literature? Many many many interesting questions.

The move to do this without any human curation is a bold one. Other people in this space hold the opinion that this approach currently has natural limits, but it’s clear that the Yenow folk don’t see it that way. I don’t know how to test that, but maybe as searches on the platform become more focussed, that’s the moment where those differences could come to light.

I do have some comments on the product itself. I spent a little time today using the demo site available from the state library of Bavaria. It strikes me that I would quite like to be able to choose my own relevance criteria so that I can have a more exploratory relationship with the results. I did find a few interesting connections through querying against some topics that I was recently interested in, but I had the itch to want to be able to peel back the algorithm to try to understand how the concepts were generated. It’s possible that this kind of search angst was something that I experience years ago with keyword search, but that years of practice have beaten the inquisitiveness out of me, but for now that is definitely something that I noticed while using this concept map, almost a desire to know what lies in the spaces between the connections.

At the moment they are looking to sell a subscription into libraries. It’s almost certain that this won’t totally replace current search interfaces (that sentence might come back to haunt me!). The challenge they face in this space is that they are Yet Another Discovery Interface, and people using these tools probably don’t invest a huge amount of time learning their intricacies. On the other hand the subscription model can be monetized immediately, and you don’t have to compete with Google head to head.

On a minor note looking at their interface there is an option to sign in, but It’s not clear to me why I should. I imagine that it might save my searches, that it might provide the opportunity for me to subscribe to some kind of updating service, but I just can’t tell from the sign up page.

CrossRef Event Data - Joe Wass - @JoeWass

By this stage in the evening the heat was rising in the room, and the jet lag was beginning to kick in, so my notes start to thin out a lot. Joe presented some updates on the CrossRef event data service. It was great to see it live, and I’d love to see it being incorporated into things like altmetric. Perhaps they need a bounty for encouraging people to build some apps on top of this data store?

At the moment they are generating about 10k events per day. They have about 0.5M events in total.

They provide the data as CC0, and for every event in the data store they give a full audit trail

Musicians and Scientists - Eva Amson - @easternblot

Eva gave a beautiful little talk about the relationship between scientists and musicians, and that there are a disproportionally high number of scientists who play instruments than in the general population. She has been collecting stories for a number of years now and the overlap between these two activities is striking. You can read more about the project on her site and you can catch Eva playing with http://www.londoneuphonia.com on Saturday at St Paul’s Church Knightsbridge.

Three posts about product development

lean value tree

I’m catching up on some reading at the moment. Trying to make headway on some other work while jet lagged is proving a challenge. Anyway, here are a couple of nice posts about product development that popped up in my feed (hat tip to Mind the Product Weekly Newsletter.

## What do people do in the spaces in between?

When thinking about what people do with your product, also think about what they don’t do, and how to help them get to where they are going.

The takeaway from this post is that by mapping out these interstitial moments you can get to a better understanding of your users needs, and better map the requirements of what you need to build.

## We have been getting MVP wrong all this time, the point is to validate, not to delight for it’s own sake.

Forget “MVP”, focus on testing your biggest assumptions

The key point in this post is that when deciding what to ship, use each iteration as an opportunity to test your riskiest assumptions, and understand what you expect to learn with each release. If you don’t know what those assumptions are, or what you are going to learn, why are you shipping a feature? I imagine that this post is mostly directed towards products that are still exploring the market-fit space, however even established products live within spaces that are evolving so some of this thinking carries over too.

It reminds me of the Popperian view that you can’t prove hypothesis, but you can reject them, so each experiment to be most valuable should be constructed to try to reject the most critical hypothesis.

I think there is at least one counter argument to the main point in this post, but you know things are complex, so that’s OK. If you are in a space where you understand your users well, and you have considerable experience to hand, it is probably OK to just do what you know to be right in terms of benefitting the user.

Burn the roadmaps!!

Throw out the product roadmap, usher in the validation roadmap!.

This post was very welcome reading for me as I have a terrible relationship with product roadmaps, I just think that in a fast moving environment you don’t know what you are going to be doing in 12 months, and god forbid if you are tied down already to what you are going to be doing in 18 months, then you are probably not exploring a new space. Of course when you get to scale, and when you get to work on projects at scale, those kinds of timelines do in fact make sense, but I still like the idea of flipping the roadmap into one that is constructed around confirming/testing our understanding of the world in contrast to constructing how we want to roll our our features.

Lean value tree, and constant experimentation

The image at the top of this post is a representation of a tool called the lean value tree (see slide 30 from this deck. We have been using it a bit in the last two months at my current role, and I’m finding a lot of value in it. One of the things that ties all three of the posts that I have linked here together is the idea of experimentation. Understand your missing assumptions, test rigorously, be led in decision making about what you can learn. Something like the lean value tree can sit above these imperatives and help you make decisions around which experiments to spin up, and how to balance opportunities. Having worked it pretty hard in the past few weeks I can see that it has a lot of value, but it still does not beat open conversation in an open team.

PLOS are looking for a new CEO

So I hear that PLOS are looking for a new CEO. They are making the process fairly open, so if you are interested you can read more here.

I got to thinking about some of the challenges and opportunities facing PLOS over the weekend. Over the years I’ve gotten to know a lot of PLOS folk, and I think it’s an amazing organisation. It has proved the viability of open access, and their business model is being copied by a lot of other publishers. At the same time they have had a fairly high frequency of turn over of senior staff in the last couple of years. So what are the likely challenges that a new CEO will face, and what should they do about them? (Time for some armchair CEO’ing).

The condensed view of PLOS’s mission that they want to to accelerate progress in science and medicine. At the heart of their mission is the belief that knowledge is a pubic good, and leading on from that, that the means for transmitting that knowledge should also be a public good (specifically research papers).

It was founded in 2001 by three visionaries, and it was configured to be a transformational organisation that could catalyse radical change in the way that knowledge is created and disseminated, initially in particular in contrast to the subscription model for distributing scholarly content.

Since launching PLOS has found massive success with the introduction of PLOS one, currently the largest journal in the world. That rapid growth led to a period of significant scaling and adjustment for the organisation, where it had to keep running at full pace in order to stay just about on top of the flood of manuscripts that were coming its way. This also created a big revenue driver for the organisation that has led to PLOS one being the engine that drives the rest of the PLOS.

So now we have the strategic crux facing any incoming CEO. The organisation has an obligation to be radical in it’s approach to further it’s mission, but at the same time the engine that drives the organisation operates as such scale that changes to the way it works introduce systemic risks to the whole organisation. You also have to factor in that the basic business model of PLOS one is non defensible, and market share is being eroded by new entrants, in particular Nature Communications, so it is likely that no changes also represents a risky strategy.

So what to do?

There are probably many routes to take, and there are certainly a large number of ongoing activities that PLOS is engaged in as part of the natural practice of any organisation. I think the following perspectives might have some bearing on where to go. As with any advice, it’s much easier to throw ideas across the wall when you don’t have any responsibility for them, but I’m going to do it anyway in the full awareness that much of what I say below might not actually be useful at all.

Changing PLOS does not change scientists

PLOS has shown that Open Access can succeed, and it’s existence has been critical to confirm the desire of researchers who want to research conducted as an open enterprise. That has allowed those researchers to advocate for something real, rather than something imagined. However, there remain a large number of researchers for whom the constraints of the rewards system they operate under outweigh any interest they may have in open science. I think it is important to recognise that no matter what changes PLOS introduces, those changes on their own will not be sufficient to change the behaviour of all (or even of a majority) of researchers. Being able to show plausible alternatives to the existing system is important, but it is also important to continue to work closely with other key actors in the ecosystem to try to advance systemic change. What that tells me is that the bets that PLOS ought to take on to create change do have to be weighed against their likelihood to affect all researchers, and the risks they introduce to the current business model of PLOS.

On the other hand you do want to progressively make it possible for people to be more open in how they conduct science. We talked a lot at eLife about supporting good behaviours, and you could imagine using pricing or speed mechanisms as a way of driving that change (e.g. lower costs for publishing articles that have been placed on a preprint server, for instance). One does have to be careful with pricing in academic circles as usually costs to publication are rarely a factor in the decision of an academic around where to publish, but generally I’m in favour of providing potentially different routes through a product to different users, and making the routes that promote the behaviours I support be easier/cheaper. (Github do this brilliantly by make open code repositories free to host, and only making you pay if you want to keep your code private).

How do you balance risk?

One of the things that is consistent in innovation is that we mostly don’t know what is going to succeed. I expect that the success of PLOS one probably took PLOS by surprise. It was a small change to an existing process, but it had a dramatic effect on the organisation.

It seems to me that what you want to do is to have a fair number of bets in play. If we accept that we mostly won’t know what is going to succeed in the first place, then the key thing is to have a sufficient number of bets in place that you get coverage over the landscape of possibilities, and you iterate and iterate and iterate on the ones that start working well, and you have the resolve to close down the ones that are either making no progress or are getting stuck in local minima.

Product Horizons

I like the idea of creating a portfolio of product ideas around the three horizons principle. There are lots of ways of determining if your bets are paying off. One of the things that I think PLOS needs to do is to ensure that at least a certain minimum of it’s financial base is being directed towards this level of innovation.

I don’t think that is a problem at all for the organisation in terms of creating tools like ALM and their new submissions and peer review system, but I’m not clear on whether they have being doing this strategically across all of the bases where they want to have an impact. That’s not an easy thing to do, balancing ongoing work, new ideas, being disciplined to move on, being disciplined enough to keep going with the realisation that real success sometimes takes you by surprise.

PLOS may need diversification

As I referred to above, the business model of PLOS, as it’s currently configured, is not easily defensible. Many other publishers have created open access journals with publishing criteria based on solidity of the science rather than impact. The Nature branded version of this is now attracting a huge number of papers (one imagines driven by the brand overflow from the main Nature titles). This speaks to me that there is some value in looking at diversifying the revenue streams that PLOS generates. This could be around further services to authors, to funders or to other actors in the current scholarly ecosystem. Here are three ways to potentially look at the market.

One; what will the future flow of research papers look like, how does one capture an increasing share of that? Will increased efficiencies of time to publication, and improved services around the manuscript be sufficient, how might the peer review system be modified to make authors happier.

Two; ask how will funding flow to support data and code publishing, will there be funding for creating new systems for assessment? Can any services that benefit PLOS be extended to benefit others in the same way?

Three; if you are creating platforms and systems that can be flexible and support the existing scale of PLOS, what might the marginal investment be to extend those platforms so that others could use them (societies, small groups of academics that want to self-publish, national bodies or organisations from emerging research markets).

The key here is not to suggest that PLOS has to change for it’s own sake, but rather to be clear about exploring these kinds of options strategically. It might be that you can create streams of revenue that make innovation be self-supporting, it might be that you hit on a way to upend the APC model. These efforts could be seen as investment in case the existing driver of revenue continues to come under increasing pressure in the future.

Ultimately you want to build a sustainable engine for innovation.

Who does all of the work?

In the end all of the work is done by real people, and the key thing any new CEO is going to have to do is to bring a clarity of purpose, and to support the staff who are in the thick of things. What I’ve seen cause the most dissatisfaction in staff (aside from micromanagement - a plague on the houses of all micro-mangers), is a lack of ability to ship. This usually comes down to one of two causes, either priorities chance too quickly, or unrealistic deadlines are set that lead to the introduction of technical debt, that causes delays in shipping. It’s key to try to identify bottlenecks in the organisation, and (as contradictory as it might sound) to try to create slack in people’s schedules to allow for true creative work to happen.

If everyone is going open access why should PLOS exist, has it now succeeded in some way?

Given that almost all new journal launches are now open access journal launches, has PLOS effectively won? Could the existing PLOS as it exists essentially go away? I think within one area of how we get to an open research ecosystem that might actually be true, however that only speaks to access to the published literature. Open science requires so much more than that. It needs transparency around review, efficiency in getting results into the hands of those who need them, data and code that are actionable and reusable, a funding system that abandons it’s search for the chimera of impact, an authoring system that is immediately interoperable with how we read on the web today.

So what to do with PLOS as it’s currently configured? I see the current PLOS, with it’s success, as being an opportunity to generate the revenues to continue to explore and innovate in these other areas, but I think that the current system should be protected to ensure that this is possible.

In the end of the day, what does a CEO do?

I can’t remember where I read it now, but one post from a few years back struck me as quite insightful. It said that a CEO has three jobs:

  • make sure the lights stay on
  • set the vision for the organisation
  • ensure that the best people are being hired, and supported

PLOS is in a great position at the moment. It has a business model that is working right now, and is operating at a scale that gives any incoming CEO a good bit of room to work with. It’s a truly vision led organisation, whose ultimate goal is one that can benefit all of society. It has great great people working for it.

I don’t think that the job is in anyway going to be a gimme, but it’s got to be one of the most interesting challenges out there in the publishing / open science landscape at the moment.

Reverse DOI lookups with Crossref

Today I had a need to think about how to do a reverse lookup of a formatted citation to find a DOI.

@CrossrefOrg helped out and pointed me to the reverse api endpoint. It workes like this:


Created a json payload file “citation.json” formatted as follows:

  "	Curtis, J. R., Wenrich, M. D., Carline, J. D., Shannon, S. E., Ambrozy, D. M., & Ramsey, P. G. (2001). Understanding physicians’ skills at providing end-of-life care: Perspectives of patients, families, and health care workers. Journal of General Internal Medicine, 16, 41-49.

Call the API using CURL (you need to set the Content-Type header to application/json)

$ curl -vX POST http://api.crossref.org/reverse -d @citation.json –header “Content-Type: application/json”

I then got the following response:

{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2016,10,25]],"date-time":"2016-10-25T11:17:12Z","timestamp":1477394232160},"reference-count":21,"publisher":"Springer Nature","issue":"1","content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["J Gen Intern Med"],"cited-count":0,"published-print":{"date-parts":[[2001,1]]},"DOI":"10.1111\/j.1525-1497.2001.00333.x","type":"journal-article","created":{"date-parts":[[2004,6,9]],"date-time":"2004-06-09T16:44:02Z","timestamp":1086799442000},"page":"41-49","source":"CrossRef","title":["Understanding Physicians' Skills at Providing End-of-Life Care. Perspectives of Patients, Families, and Health Care Workers"],"prefix":"http:\/\/id.crossref.org\/prefix\/10.1007","volume":"16","author":[{"given":"J. Randall”,"family":"Curtis","affiliation":[]},{"given":"Marjorie D.","family":"Wenrich","affiliation":[]},{"given":"Jan D.","family":"Carline","affiliation":[]},{"given":"Sarah E.","family":"Shannon","affiliation":[]},{"given":"Donna M.","family":"Ambrozy","affiliation":[]},{"given":"Paul G.","family":"Ramsey","affiliation":[]}],"member":"http:\/\/id.crossref.org\/member\/297","container-title":["Journal of General Internal Medicine"],"original-title":[],"deposited":{"date-parts":[[2011,8,10]],"date-time":"2011-08-10T15:39:02Z","timestamp":1312990742000},"score":120.61636,"subtitle":[],"short-title":[],"issued":{"date-parts":[[2001,1]]},"alternative-id":["10.1111\/j.1525-1497.2001.00333.x"],"URL":"http:\/\/dx.doi.org\/10.1111\/j.1525-1497.2001.00333.x","ISSN":["0884-8734","1525-1497"],"citing-count":21,"subject":["Internal Medicine"]}}

From this we can see that crossref suggests the following DOI lookup with a score of “120” http:\/\/dx.doi.org\/10.1111\/j.1525-1497.2001.00333.x

There is some backslash escaping going on here, so the actual lookup url is: http://dx.doi.org\/10.1111/j.1525-1497.2001.00333.x.

This directs us the the following article, which does seem to be the one that we are interested in.

What do we mean when we talk about Big Data?

What do we mean when we talk about Big Data?

The following blog post about this article provides the following definition of big data:

“High volume data that frequently combines highly structured administrative data actively collected by public sector organisations with continuously and automatically collected structured and unstructured real-time data that are often passively created by public and private entities through their internet.”

The article is behind a paywall, but the blog is pretty clearly laid out. The authors seem mostly concerned about how the term big data is used by researchers who are mostly coming from a background of working with public sector data.

My takeaways from the blog post are:

* public sector use of the term *Big Data* is sometimes divergent from what the term means in the private sector 
* real time data collection *could* be a vice in the public sector 
* Digital exhaust data is only coincidentally aligned with having any utility for answering public policy questions, and given that is it at all suitable for such purposes? 
* The ethics of the use of this data are unclear 
* This kind of present Big Data is not representative of our full lives, nor representative of all citizens 
* There remains great potential, but excitement around this potential must be tempered with an understanding of the current inherent limitations of this resource. 

I think these are all reasonable positions to take at the moment, however the definition of big data leaves open how we might interpret what high volume means.

A position I’m coming to about big data is that it’s mostly around how comfortable you feel with the data, and that one person’s big data is another’s batch job. What the explosion of data has created is an increase in the number of occasions where a particular researcher will hit against the limits of what is technically possible to them, at that moment in time. Setting aside all of the questions about what is in the underlying data, and how well it may or may not be a good fit for the research question being asked, what I find very exciting is that the journey of gaining the capacity to work with the data that you think is big today is one which will create a cohort of researchers who are unafraid to also deal with what may be big for them tomorrow. In this way we create an environment of fantastically skilled researchers, who are potentially in a better position to tackle hard problems than they are today.

Hello SAGE!

Hello SAGE!

I joined SAGE at the start of September. Hello SAGE!! Here I outline some of my initial impressions.

First up, I’ve been really delighted to meet so many great people at SAGE. I’ve received great support from everyone in the company. I generally find publishing folk to be very friendly. This is a friendly industry, working on the fabric of knowledge, knowing that your work can help to make a difference, trying to make the work of academics a bit easier. I believe that these are all things that can help to create a good environment for an industry to be situated in. All that aside, I’ve still been really impressed by how lovely everyone is. I think that comes from some initial interactions that I had way back in my first week, and it’s only continued through the weeks.

One obvious change at SAGE is the scale of the company. It’s a good bit bigger than eLife, and I’ve not worked in a company close to this size since 2010. At Mendeley, and later at eLife, I saw what happens as a company starts to grow out beyond the point where not everyone is able to know everything that is going on (that’s not a bad thing at all, just an inevitable part of the development of a company). Back when I was working in Springer and Nature my work mostly involved interacting with people in close proximity to my project. What I’m working on now is of interest across the company. Communicating across the natural silos of information that will emerge in a large organisation such as SAGE has required some new thinking. The main thing to note is the existence of structure that is contingent on the history of how that structure emerged, and the best thing I’ve found for understanding that quickly is just to tap into the collective wisdom that already exists within the organisation. Basically asking people who have been around for a lot longer than I have about how to do things, or who to talk to. That’s mostly been successful. The one time where it didn’t work so well was when asked someone a few things, only to discover pretty quickly that they only knew fractionally more than I did because they had only been here about a week longer than me!

I’d never known a huge amount about SAGE before starting to think seriously about coming on board. I’d known a few people for a few years, whom I held in fairly high regards. I didn’t know that the name SAGE comes from Sara and George, the founders of the company. Sara is still very much involved in the company, and chairs the board meetings, as well as continuing to take a keen interest the strategic direction of the company. Since joining I’ve had the pleasure of meeting her a couple of times, and I’ve been hugely impressed with how impassioned she is for the important role that social research can play in society. One moment in particular stands out. It was a few weeks ago, just a few days after the US election. Moods were a little deflated. She stood up at a small meeting and simply articulated the importance of what social science researchers are doing for societal outcomes. She talked about how organisations like SAGE are in a privileged position, and being in that position almost sets a demand on them to do what they can to help support that role of social science.

I think this connects well with another thing that I’ve learnt in the last few months. For the particular project that I’m working on I’m spending a lot of time talking directly to researchers. They uniformly have a positive attitude to SAGE, and the things that SAGE builds in this space. It’s clear that the values of company really seep into how they act in the market.

So what is it that I’m doing now? My job title here at SAGE is Head of Product Innovation. For the time being that title sits in front of one very specific project. My main responsibility over the next year is to support the emerging field of what we might loosely call computational social science. Specifically the team I am in are working on finding services that SAGE can partner on, or build. It’s a pure greenfield product development position.

Here I’m not going to get into the nitty gritty of what to call a data intensive way of doing social science (there are subtleties around whether we call it A or B, or some other label), but I’ll tell you what we currently believe, and what we have observed.

We believe that data at scale is transforming many aspects of how social science is done, and with that transformation is will come the opportunity to answer questions that were previously intractable, as well as making it easier to tackle currently hard questions. We believe we are seeing the emergence of a new methodology for how social science can be done (but we also believe that this does not remove the need for existing methodologies, rather it will enhance them). My favourite analogy here is to think of this as akin to the creation of a new kind of telescope or instrument. It opens up new ways of viewing and understanding the world that builds upon and broadens what we already know.

We see some groups out there already doing this kind of work, and we see many others who are interested but who face a variety of barriers to starting with these techniques. This is where the project I am working on comes in. Initially we are trying to understand these barriers, and design things that can help reduce them.

There are many many reasons why this move towards data intensive social science may be important. At a very basic level expanding the tools available to scholars is always a good thing. Being able to make the most of the implicit data that is now a by-product of the digital interfaces of our lives may move us from a position where we may be haunted by that data to a position where we have have the means to understand how to deal with it. I feel that most importantly it’s also about bringing some humanity to the systems that we are building today. These digital systems and the data that we as a society, and as individuals, are generating are determinative to many social outcomes. If the only driver for the creation of these systems is the market then those outcomes are probably not going to be wholly fantastic. (Cathy O’Neil writes about this clearly in Weapons of Math Destruction). To help balance this I think social scientists need a seat at the table when it comes to the design and engineering of those systems.

Over its fifty year history the publication of methods has been core to what SAGE does. From this perspective finding a way for SAGE to support the emergence of a new class of methodologies makes perfect sense for us. We are not working in isolation either, rather, we are contributing to a strong trend in a way of thinking about the systems that surround us. We want to help by being an active partner in initiatives that can help with the agenda I’ve outlined above, and where we have the opportunity to build things that help move that forward, we will try to do so.

Our thinking about the kinds of things that we can help to create is still very open ended at this point in time. It is also almost impossible to predict what are the things that you do that will have a real impact, and what are the things that you do that end up not making much difference. What is clear is that you don’t have a chance of finding out if you don’t try. We aim to try, and to learn, and hopefully we can iterate on what we learn to find a way to make a meaningful contribution.

I’ve been asked by a lot of people why I decided to move from eLife to SAGE. I’ve already outlined here a bit about the project that I’m working on. Overall when I was approached with this opportunity I decided to weigh up three factors, what impact might the project have, what impact might I have on making the project a success, and overall how would working on this help me support my family.

It was clear to me pretty quickly that this project has the potential to be impactful, and certainly that the motivations behind the instigation of the project were very aligned with my own personal beliefs and interests. I also felt that my background in working on digital tools for researchers over the last few years was a good fit for the needs of the project. This past summer my family grew by one, and now with two young children to juggle (sometimes very literally), the opportunity to work on stuff that matters, and to do so from quite close to home, was one that I had to look very closely at. I made the jump, and now three months in I can honestly say I’m just getting more and more excited by where we are going with the project.

If you have managed to get this far and you are still interested in what we are working on, then maybe you might like thinking about joining us? We currently have a very small team. We decided from the outset to pursue a lean approach to product discovery and development. The team is myself, half of the amazing Katie Metzler and some time from the also awesome Martha Sedgwick.

After a lot of learning in the last three months we have decided that we want to bring in another person full time (on a 12 month contract) to help with the development of the project. We are initially setting the position to be a 12 month fixed contract, as we just have a high degree of uncertainty around what the ideal shape of the team will be in a year from now.

Initially we we want help with the following kinds of things (at the heart of which is helping us to understand the needs of researchers, and helping us to follow up on the many amazing conversations and leads that we are having right now), but we also have a fair expectation that the role, and the entire project, will continue to evolve over the coming year:

* Assist with market segmentation and market sizing
• Conduct competitor analysis product positioning
• Recruit a pool of relevant users for testing of product concepts with
• Conduct solution interviews
• Participate in usability and product concept testing
* Participate in product ideation workshops
* Synthesise and capture feedback from interactions with researchers, and share and distribute that feedback amongst other team members
* Provide product development support during the build phase from concept to MVP
* Be a voice for the user through the evolution of our product ideas.

If you are interested and want to find out more please reach out to me!

What we mean when we talk about preprints

Cameron Neylon, Damian Pattinson, Geoffrey Bilder, and Jennifer Lin have just posted a cracker of a preprint onto biorxiv.

On the origin of nonequivalent states: how we can talk about preprints

Increasingly, preprints are at the center of conversations across the research ecosystem. But disagreements remain about the role they play. Do they “count” for research assessment? Is it ok to post preprints in more than one place? In this paper, we argue that these discussions often conflate two separate issues, the history of the manuscript and the status granted it by different communities. In this paper, we propose a new model that distinguishes the characteristics of the object, its “state”, from the subjective “standing” granted to it by different communities. This provides a way to discuss the difference in practices between communities, which will deliver more productive conversations and facilitate negotiation on how to collectively improve the process of scholarly communications not only for preprints but other forms of scholarly contributions.

The opening paragraphs are a treat to read, and provide a simple illustration of a complex issue. They offer a model of state and standing, that provides a clean way of talking about what we mean when we talk about preprints.

There are a couple of illustrations in the paper of how this model applies to different fields, in particular, physics, biology, and economics.

I think it would be wonderful to extend this work to look at transitions in the state/standing model within disciplines over time. I suspect that we are in the middle of a transition in biology at the moment.

Textometrica, a tool review

A quick spin with Textometrica

Leviathan Network Image

Yesterday I had a good conversation with Simon Lindgren, the creator of textometrica. I decided to try out the tool before chatting to him.

Textometrica encapsulates a process for understanding the relationship and distribution of the occurrence of concepts in a body of plain text. It provides a multi-step online tool for the analysis.

The advantage of using this tool is that you don’t need to be able to do any coding to get to a point where you have some quite interesting analysis of your corpus. One potential downside is that the tool is strongly focussed on the specific workflow that Simon devised. When I talked to him later about this it was clear that he built the tool to scratch a specific itch.

In order to try the tool I needed a corpus to work with. I got a copy of Hobbs’s Leviathan from project Gutenberg, and in a plain text editor I removed the Gutenberg forward and footer.

I started by just trying to upload the file to textometrica, and it looked like I’d made the tool hang. At this point I started looking at the 10 minute video overview of the tool, and I discovered that I need to indicate a text block delimiter within the text. Using the editor I replaced all full stops with the pipe symbol | and re-uploaded, and made much more progress.

If you are interested in exploring the tool I highly recommend working through the video as you get started. The tool is not exactly self-documented, but the video gives a sufficient overview of how to use it.

In under a quarter of an hour I was able to generate a network graph of the largest co-occurring concepts in the Leviathan, and was able to create a public archive of the project.

Each step of the tool has a few custom options, and it seems to me that they were introduced as a result of Simon wanting to refine the process as he developed it. This does provide the ability to do some fine tweaking of your analysis, but at the same time the options are quite opinionated, so you would want the envisaged analysis to be quite close to what you want to do with the tool.

That said, I was able to accomplish a reasonably complex analysis on a reasonably sized corpus very quickly.

Social Humanities DataHack event

How do people represent themselves on social media, and how are they represented by others? Which qualities and virtues are emphasized (or ignored)? How polarised are these (re)presentations?

There is a workshop looking at this very question happening in Oxford in early January. The morning will be a series of workshops on tools for tackling a question like the above (I’m thinking of attending the Wikipedia and Topic Modelling workshop), and the afternoon will be a hackathon looking at some data sets.

It sounds pretty interesting, and a nice way to warm up to the new year. It’s being hosted by the The Oxford Research Centre in the Humanities, and you can signup on eventbrite.

Something broke in a Jekyll upgrade (a.k.a, sometimes I hate software)

This is a short story about software, some of the things I hate about it, my lack of knowledge of ruby, and a desire to own my own words.

For various reasons I’m working on a brand new machine, I decided that I want to start posting to my own blog again (as well as cross posting to Medium, because fuck it, why not).

That involved dusting down my Jekyll site and seeing if I could get it to work again.

It’s been a while, mind, so the very first thing that I did was go and pull my blog content down from Github and fire up Jekyll.

Jekyll has moved on since I last used it, and I discovered that the mechanism that I was using to create an index page of my tags no longer works. The following line in the rake-file that I was using is deprecated, and throws an error.


I thought, it’s one line, how hard can it be to fix? Of course, the dirty little secret is that I don’t know ruby, I’d just been using Jekyll in the past as a fast way to generate blog content from markdown. I spent several hours this afternoon trying to tack down a short comprehensible workaround, and have come to the conclusion that I won’t make progress without actually learning enough ruby to become proficient at writing ruby plugins, and my life is too short to do that.

Writing my markdown in Byword gives me almost instant access to publish to Medium via the publish button, but I want to control my own domain, and I want a git archive of my blog posts too, so what do I do?

I really liked the way that tagging used to work on my site, but have decided that the value add to getting it working again is too low, given the time that it might take me to work around the issue. I thought briefly about moving to a python static site generation tool, but that would involve so much work that it would defeat the purpose of doing what I want to do, which is to blog fairly efficiently.

In the end I decided to change my tagging strategy, and create some static tag templates. This post from Mike Apted was easy to follow along and get working.

This comes to the nub of my problem with software. I want it to serve me, and mostly to get out of my way, but by knowing just enough to have a little bit of control of my environment I often get seduced by the desire to have perfect control. I just need to step back a little, and ask, it what I want to do here worth the potential time and effort that it might take me to complete, in contrast to finding a solution that is good enough and meets most of my needs.