Sits meeting notes, November 2010.

Wed Mar 30, 2011

2060 Words

Tags: sits, jisc, DLF2010

2010-11-04 sits

Right, the meeting is starting, it’s an open meeting, so we don’t have a specific agenda, but as the day goes on topics for discussion will emerge, and these will be time-boxed for discussion.

As the participation list is available on the eventbrite site, I’m not going to capture that here.

Some topics from the last meeting

signatures on digital objects
correct identifiers on digital objects
author identifiers
lightweight languages
toolsets
sword, sword2

Some potential topics for this meeting

authentication and access control
rdf and linked data within institutions and repositories
data curation
reasonable workflow components, best practices
lightweight tool sharing
web archiving
sustainable storage
tool sharing
data curation, big data, and what the concept of curation means for repositories

Actual topics discussed

Authentication
Linked data
Web Archiving
Identifiers for people
Microservices
SEO and search engine optimisation
Lightweight Languages

topic discussions

Authentication

The first option that is discussed is shibboleth. Generally authentication breaks down according on one of the following mechanisms:

shib
non-shib
ip based

One of the questions is how does service host have trust in 3rd party services, and understand whether that other provider is doing a good job?

A problem with ldap is that groups within ldap could go out of date.

:question: has shibboleth provided support for services yet? :answer: it’s an interesting question

Are there approaches on service to service authentication that have worked 3 concrete examples:

j2k service, it retrieves content over http
integrating with JOVE, pass in to it a url source, and if that content is protected it will be the same issue
fedora

So there are social, and technological problems

How does one manage individual level access on a lightweight basis, for instance, allowing people to pass something off to an iPhone. How do you enable content owners to set the access permissions of their own data.

If you deliver all of your content with Drupal, you can just hook into drupal’s authentication systems, and this makes it look to the end user that they are just managing a part of their site.

So there are a couple of things going on here:

institute to institute
institute to end users around content held by the institute
data owner within institute to 3rd party applications for that user, e.g. iPhone apps
data owner within institute setting permissions around their own data

An interesting use case is how you allow anonymous local users on a campus wide system to access content. Shib2 has a way of allowing an ip address to skip the authentication step, it’s a pretty static list. This use case was designed for kiosk machines

There are ezproxy campuses and non-ezproxy campuses. This use case is slightly different from the normal experience of a web user because when they go to a 3rd party service we need to figure out a way that would enable them to assert the university rights at that end point.

takeaways from this talking point:

think about pushing the authentication down into the stack, so that the front layer can be plugged into a number of different areas? (I didn’t capture that point well)
articulate how openauth does or does not work within institutions (what does saml give you that oauth does not give you?)
are there any examples of universities that are using openid?
think about creating a microservices driven layer for multi-authentication systems

Linked data

What is happening and what is not happening, what could easily happen, and why is it not happening?

In blacklight search repositories there is a lot of content aggregating there, and these could be exposed as rdf. I ask what is the advantage or disadvantage of rdf over opensearch. This is answered quite nicely, linked data is about known content, and opensearch is about finding content when you don’t know exactly where it might be. Think of linked data a bit like rss.

Another question, is there other content that we should be thinking about pushing out too, is there local content, what would that look like, and how would that help in the open world?

Is there a spec for exposing triples as linked data. There are hundreds of specs, and many different ontologies. The best thing to do might be to wait and see what the big players are doing. A good example is Google’s recent adoption of GoodRelations.

One thing that really bit Southampton is to think about how it is going to be consumed, and ideally consume it yourself. A good analogy is to create a user interface, and never use that user interface yourself. It’s the only way where one will find the minor but really annoying errors.

Something to look for are javascript tools and restful apis for playing with the semantic web. Jenny Tennison is working on some of these tools like redfquery.

Another issue is what namespace do you link to, and where is the trust. (I’m in sameas, yay! [http://sameas.org/html?q=ian+mulvany&x=0&y=0][me])

[me]: http://sameas.org/html?q=ian+mulvany&x=0&y=0)

How do you decide what you want to link to?

some ontologies that might be useful

The BIBO ontology seems to be a good first place to look, and here are a bunch:

Action Items

have noted that we need to identify promising formats and vocabularies
a low hanging fruit is to have a community effort to identify, even at the predicate level, what terms to use (there is a lot of discussion going on in the UK between various groups, but there is no formal …
we will set up a freebase page for collecting ideas around this

Web Archiving

From the maemento perspective, an archive is no different from the rest of the web, it just has content from the rest of the web. That includes CMS’s, these are just archives systems, as they have the content from the past sitting in their databases.

Issues:

versioning requirements
accessibility

As a starting point it would be interesting to see what the Duraspace people are thinking.

It seems like making a repository a time gate for itself is a great idea. Dspace, however, does not do versioning. As a part of the re-architecting there is a hope to get versioning into Dspace via the fedora component, but it is a little way off.

Hydra would be dependant on the fedora component, but it is not implemented at a high level yet.

What % of revisions in repository systems at the moment are ones that could be exposed, compared to ones that are just typos?

This is a decision that one has to make in general, there may be legal restrictions, for instance. But if you do expose them, then doing so in a way that is browsable is probably a good idea.

An interesting question, if something like google becomes maemento aware then it’s likely that this will take care of this issue.

There seems a lot of years between now, and when google is going to provide time aware content.

There is a difference between doing a historic search, and currently available history.

An interesting thing might be to be able to get a list of recent uris from an archive, and find a way to preserve the content at those uris.

There is a “wayn” protocol for handing over web archives from one place to another.

A question comes up for how do you archive rapidly moving web sites. There is no good approach for this yet. Transactional archiving is an approach that might be investigated.

Action Items

if you want to be well behaved, it can be hard to archive social media sites (one could look at consuming feeds)

# Lunch

Now we have a discussion about what we will talk about in the evening session.

Identifiers for people

In terms of author identifiers, Southampton have gone for a local approach, and are creating linked data identifiers. Where two people are the same, use sameas to link them. They do make them available.

The ORCID project is going to go and push ORCIDs onto people that publishers know about. This won’t happen immediately, but will get better as time goes on. Disambiguating authors is costly, so there is a large advantage from doing this as early in the publication process as you can.

ORCID will also act as a registry, so one can lookup things like a REPEC and other identifiers that an author might have.

The big question is whether ORCID should just be an identifier, or whether it should contain information like bibliographies. The issue is if it is an identifier service only, how will it get paid for.

Right now thee is a testbed, the code from Thompson has been submitted, does not cover every use case yet.

There is a hope that this will be available in 2011 at some level. There is a big stakeholders group meeting in London on November 18th.

There will be APIs for deposit and search, as well as a web protocol

Independent researchers will be able to register themselves.

There will be provenance for who made different assertions, whether, for example, a publisher or an institution the independent researcher.

Some questions: has there been any discussion around assuming that this needs to be a central vs distributed blob of information?

isni is another initiative, it’s an iso standard. A number of music. isni’s require that you have published something that people consider to be authoritative. The names project seems to be working with the isni project.

There is work ongoing to test ORCIDs against Dspace.

Resources

Microservices

What do we mean by micro-services? It’s becoming a buzz word because it is identifying an emerging pattern.

There seems to be some confusion over microservices as just being the CDL implementation in contrast to a general architectural approach.

I’m going to skip out and talk to my wife now ;) OK, I’m back from the conversation with my wife.

Resources

CDLIB

We are now onto the final session.

SEO and search engine optimisation

Dspace is working with google scholar, and Anurag at google. Essentially there are a list of tags that google scholar wants, and if you fill those out then google scholar will give you more coverage.

Scholar has a citation meta-tag schema, and so the exercise is to map dublin core to that schema. Many of the local repositories (what they really want are the pdfs, and have metadata around those pdfs). What google scholar want is totally different from what Google news wants, and what the main google search engine wants.

They already had google site maps, but google scholar don’t use those.

You ask scholar to crawl your site. You email Anurag.

An interesting question is how much traffic comes in from google, there is data that about 90% of the repo traffic comes in from google scholar, and more and more of the faculty are now using google scholar over web of science, so this is really really important.

What’s interesting is that google scholar are trying to encourage people to use the de-facto Google Scholar standard

An interesting use of testing is to use cucumber and a subject specific expert to write tests like:

“when I search for blah” “I should get back search result blah”

There is an interesting conversation about how would this community propose as a better way to organise metadata on the web, and potentially look to propose something in RDF.

Resources

Actions

Perhaps take this issue to NISO, to determine whether there is space for publishers and repositories to push back against google for adoption of better metadata tags. An example would be to propose possibly RDFa (perhaps under the bibo ontology)

Lightweight Languages

A discussion ensues about adopting new tools, and how to convince IT managers to adopt new tools.

security holes will almost always come from customisations
setting expectations, and having a regular patch cycle
ensure that there is a good shared install base with the same dependancy trees
can you turn something old off
log inbound requests for software stacks
bundle your stuff as a war file and make it pretend that it’s a java program

We have not talked about developer retraining. Why should people learn a new language?

Resources

Squid-cache

This work is licensed under a Creative Commons Attribution 4.0 International License