Humbly Report: Sean Bechhofer

Semantics 'n' stuff

Archive for the ‘rdf’ Category

Voles on the Line

with 4 comments

Rodent at Heaton Chapel (artist’s impression)

A friend of mine, Di Maynard, who works in computational linguistics and NLP, alerted me to cheapbotsdonequick last week, a service that makes it really easy to set up a twitter-bot. It hooks up to a twitter account and will tweet generated messages at regular intervals. The message content is generated via a system called tracery, using a grammar to specify rules for string generation. There are a number of bots around that use this service including some that generate SVG images — @softlandscapes is my favourite. I thought this looked like an interesting and fun idea to explore.

I’d done some earlier raspberry pi-based experiments hooking up to real-time rail information, so I decided to stick with the train theme and develop a bot tweeting “status updates” for Northern Rail. These wouldn’t quite be real updates though.

A tracery grammar contains simple rules that are expanded to produce a final result. Each rule can have a number of different alternatives, which are chosen at random. See the tracery tutorial for more information. For my grammar, I produced a number of templates for simple issues, e.g.

high volumes of X reported at Y

plus some consequences such as re-routing or disruption to catering services. The grammar allows us to put together templates plus rules about capitalisation or plurals etc.

For the terminals of the grammar — the things that appear as X or Y, I pulled lists from an external, third party data source: dbpedia. For those who aren’t aware of dbpedia, it’s a translation of (some of) the data in Wikipedia into a nicely structured form (RDF), which is then made available via a query endpoint. In this case, I used dbpedia’s SPARQL endpoint to query for words to use as terminals in the grammar. There are other open data sources I could have used, but this was one I was familiar with.

This allowed me to get hold of the stations managed by Northern Rail, plus some “causes” of disruption, which I chose to be European Rodents, Amphibians, common household pests and weather hazards. The final grammar was produced programmatically (using python).

The grammar then produces a series of reports, for example:

Wressle closed due to Oriental cockroaches. Replacement bus service from Lostock Gralam.

The bot is currently set up to tweet at regular intervals, and to date has picked up 6 followers — five of which aren’t me! You can find it at @chromaticwhale. Code is available on github.

So, is there anything to this other than some amusement value? Well, not really, but there are perhaps a couple of points of interest. First off, it’s an illustration of the way in which we can make use of third party, open information sources. This is nice because:

  • I don’t need to think about lists of European rodents and amphibians or stations served by Northern Rail.
  • The actual content of the lists were unseen to me, so the combinations thrown up are unexpected and keep me amused.
  • I can substitute in a different collection of stations or hazards and extend when I get bored of hearing about Cretan frogs and Orkney voles.
  • The data sources use standardised vocabulary for the metadata (names etc.) so it’s easy to pull out names of things (potentially in other languages).

I teach an Undergraduate unit on Fundamentals of Computation that focuses largely on defining languages through the use of automata, regular expressions and grammars. The grammars here are (more or less) context free grammars, so this gives an amusing example of what we can do with such a construct.

I am now awaiting the first irate email from a traveller who “didn’t go for the train because you said the station was closed due to an infestation of Orkney Voles”.


Written by Sean Bechhofer

November 16, 2016 at 4:24 pm

Posted in rdf

All the World’s a Stage

with 2 comments

Jason Groth Wigs Out

Anyone who knows me is probably aware of the fact that I’m a keen amateur* musician. So I was very pleased to be able to work on a musical dataset while spending some sabbatical time at OeRC with Dave De Roure. The project has been focused around the Internet Archive‘s Live Music Archive. The Internet Archive is a “non-profit organisation building a library of internet sites and other cultural artifacts in digital form”. They’re the folks responsible for the Way Back Machine, the service that lets you see historical states of web sites.

The Live Music Archive is a community contributed collection of live recordings with over 100,000 performances by nearly 4,000 artists. These aren’t just crappy bootlegs by someone with a tapedeck and a mic down their sleeve either — many are taken from direct feeds off the desk or have been recorded with state of the art equipment. It’s all legal too, as the material in the collection has been sanctioned by the artists. I first came across the archive several years ago — it contains recordings by a number of my current favourites including Mogwai, Calexico and Andrew Bird.

Our task was to take the collection metadata and republish as Linked Data. This involves a couple of stages. The first is to simply massage the data into an RDF-based form. The second is to provide links to existing resources in other data sources. There are two “obvious” sources to target here, MusicBrainz, which provides information about music artists, and GeoNames, which provides information about geographical locations. Using some simple techniques, we’ve identified mappings between the entities in our collection and external resources, placing the dataset firmly into the Linked Data Cloud. The exercise also raised some interesting questions about how we expose the fact that there is an underlying dataset (the source data from the archive) along with some additional interpretations on that data (the mappings to other sources). There are certainly going to be glitches in the alignment process — with a corpus of this size, automated alignment is the only viable solution — so it’s important that data consumers are aware of what they’re getting. This also relates to other strands of work about preserving scientific processes and new models of publication that we’re pursing in projects like wf4ever. I’ll try and return to some of these questions in a later post.

So what? Why is this interesting? For a start, it’s a fun corpus to play with, and one shouldn’t underestimate the importance having fun at work! On a more serious note, the corpus provides a useful resource for computational musicology as exemplified by activities such as MIREX. Not only is there metadata about large number of live performances with links to related resources, but there are links to the underlying audio files from those performances, often in hgh quality audio formats. So there is an opportunity here to combine analysis of both the metadata and audio. Thus we can potentially compare live performances by individual artists across different geographical locations. This could be in terms of metadata — which artists have played in which locations (see the network below) and does artist X play the same setlist every night? Such a query could also potentially be answered by similar resources such as The presence of the audio, however, also offers the possibility of combining metadata queries with computational analysis of the performance audio data — does artist X play the same songs at the same tempo every night, and does that change with geographical location? Of course this corpus is made up of a particular collection of events, so we must be circumspect in deriving any kind of general conclusions about live performances or artist behaviour.

Who Played Where?

The dataset is accessible from There is a SPARQL endpoint along with browsable pages delivering HTML/RDF representations via content negotation. Let us know if you find the data useful, interesting, or if you have any ideas for improvement. There is also a short paper [1] describing the dataset submitted to the Semantic Web Journal. The SWJ has an open review process, so feel free to comment!


  1. Sean Bechhofer, David De Roure and Kevin Page. Hello Cleveland! Linked Data Publication of Live Music Archives. Submitted to the Semantic Web Journal Special Call for Linked Dataset Descriptions.

*Amateur in a positive way in that I do it for the love of it and it’s not how I pay the bills.

Written by Sean Bechhofer

May 23, 2012 at 1:23 pm

Posted in linked data, music, rdf

Tagged with

The Eurovision Workflow Contest

with one comment

Ever wondered where the workflows that are most downloaded or viewed in myExperiment come from? Wonder no longer! Here’s a nifty visualisation using Google’s Public Data Explorer:

myExperiment Statistics

What’s it doing?

myExperiment is a Virtual Research Environment that supports users in the sharing of digital items associated with their research. Initially targeted at scientific workflows, it’s now being used to share a number of different contribution types. For this particular example, however, I focused on workflows. Workflows are associated with an owner, and owners may also provide information about themselves, for example which country they’re in. The site also keeps statistics about views and downloads of the items. This dataset allows exploration of the relationship between these various factors.

How does it work?

The myExperiment data is made available as a SPARQL endpoint, supporting the construction of client applications that can consume the metadata provided by myExperiment. A few simple SPARQL queries (thanks to David Newman for SPARQLing support) allowed me to grab summary information about the numbers of workflows in the repository, their formats, and which country the users came from. The myExperiment endpoint will deliver this information as CSV files, so it’s just then a case of packaging these results up with some metadata and then uploading to the Public Data Explorer. Hey presto, pretty pictures!

The Explorer expects time series data in order to do its visualisations. The data I’m displaying is “snapshot” data, so there’s only one timepoint in the time series — 2011. We can still get some useful visualisations out though, allowing us to explore the relationships between country of origin, formats, and numbers of downloads and views.

This is, to quote Peter Snow on Election Night, “just a bit of fun” — I’ve made little attempt to clean the data, and there will be users who have not supplied a country, so the data is not complete and shouldn’t necessarily be taken as completely representative of productivity of the countries involved! In addition, the data doesn’t use country identifiers that the data visualiser knows about, and has no lat/long information, so mapping isn’t available (there are also some interesting potential issues with the use of England and United Kingdom). However, it’s a nice example of plumbing existing pieces of infrastructure together in a lightweight way. Although this was produced in a batch mode, in principle this should be easy to do dynamically.

So, Terry, the results please…

“Royaume Uni, douze points”.

Boom Bang-a-Bang!

Written by Sean Bechhofer

March 16, 2011 at 5:30 pm

Posted in rdf, visualisation

Tagged with

Whale Shark 2.0

with one comment

The fishDelish project is a JISC funded collaboration between the University of Manchester, Hedtek Ltd and the FishBase Information and Research Group Inc. FishBase is a “a global information system with all you ever wanted to know about fishes.” FishBase is available as a relational database and the project is about taking that data and republishing as RDF/Linked Data. The project is nearing the end, and we now have the FishBase data in a triple store. I took a look at how we could generate some nice looking species pages. FishBase currently offers pages presenting information about species (for example the Whale Shark).

Whale Shark on FishBase

I wanted to try and replicate (some of) this presentation in as simple/lightweight a way as possible. The solution I adopted involves a single SPARQL query that pulls out relevant information about a species, and an XSL stylesheet that transforms the results of that query into an HTML page. The whole thing is tied together with a simple bit of PHP code that executes the SPARQL query (using RAP — a bit long in the tooth, but it does this job), requesting the results as XML. It then uses PHP’s DOMDocument to add a link to the XSL stylesheet into the results. The HTML rendering is then actually handled by the web browser applying the style sheet. The resulting species pages (e.g. the Whale Shark again) are not — to use the words of David Flanders, our JISC Programme Manager — as information rich as the original FishBase pages, but they are sexier.

Whale Shark on fishDelish

To a certain extent, that’s simply down to styling (I’m a big fan of Georgia), but the exercise did help to explore the usage of SPARQL and XSL on the FishBase dataset. The SPARQL queries and stylesheets developed will also be useful in conjunction with the mysparql libraries developed in fishDelish. mySparql is a service developed in fishDelish that allows the embedding of SPARQL queries into pages.

The first problem I faced was trying to understand the structure of the data in the triplestore. The property names produced by D2R are not always entirely, ermm, readable. As my colleague Bijan Parsia discussed in a blog post describing his fishing expeditions, the state of linked data “browsers” is mixed. I ended up using Chris Gutteridge’s Graphite “Quick and Dirty” RDF Browser to help navigate around the data set.

A second question was how to approach the queries. The species pages have a simple structure. They have a single “topic” (i.e. the species), and then display characteristics of that species. So constructing a species page can be seen as a form filling process where the attributes are predetermined. It’s possible to write a SPARQL query to get information about a species with a single row in the results. The stylesheet (e.g. for species) can grab the values out of those results and “fill in the blanks” as required. An alternative would be to use some kind of generic s-p-o pattern in the query and pull out all the information about a particular URI (i.e. the species). In the species case though, we already know what information we’re interested in getting out so the “canned” approach is fine.

I also produced some pages for Orders and Families (e.g. Rhincodontidae or Rajiformes). The SPARQL query here returns a number of rows, as the query asks for all the families in an order or species in a family. There is redundancy in the query result as the first few columns in each row are identical. A cleaner solution here might be to use more than one SPARQL query — one pulling out the family information, one requesting the family/species list. That would require more sophisticated processing though, rather than my lightweight SPARQL query + XSL approach. Again, this is something that the mysparql service would help with.

Overall, this was in interesting experiment and exercise in understanding the FishBase RDF data. Harking back to an earlier blog post from Bijan, as I’m already familiar with SPARQL and XSL, it was probably easier for me to produce these pages using the converted data, but it’s not clear whether that would be true in general. There’s actually very little in here that’s about Linked Data. This could have been done (as the current FishBase pages are done) using the existing relational db plus some queries and a bit of scripting. There was some benefit here in using the standardised protocols and infrastructure offered by SPARQL, RDF, XML and XSL though. It was also very easy for me to do all of this on the client side — all I needed was access to the SPARQL endpoint and some XML tooling. So the real benefit for this particular application is gained from the data publication.

It did help to illustrate the kinds of things we can now begin to do with the RDF data though, and puts us in a situation where we can look at further integration of the data with other data sets. For example it would be nice to hook into resources like the BBC Wildlife Finder pages, which are also packed with semantic goodness.

It was also fun, which is always a good thing! If only the Whale Sharks themselves were as easy to find…..

(This is an edited version of an fishDelish project blog post)

Written by Sean Bechhofer

March 10, 2011 at 10:48 am

Posted in linked data, rdf

Tagged with , ,

SKOS: Organisation

leave a comment »

Roman Aquaduct, Segovia

Roman Aquaduct, Segovia

Continuing the theme of reflecting on SKOS, the question of organisation is next. SKOS provides an RDF vocabulary for describing Knowledge Organisation Systems and there’s an assumption that SKOS is RDF from the ground up. The use of RDF brings advantages, but there are also limitations, in particular when we consider issues of containment. This is something that I wrestled with in the past when building the OWL-API libraries to support OWL [1]. In the RDF/XML serialisations of OWL, there was no explicit connection between the axioms stated in an ontology and the Ontology object itself. This can cause difficulties in the face of owl:imports as there was also no explicit link between the location where an RDF graph that represents and ontology is retrieved from and the URI of the Ontology itself. This was partly solved by the use of physical and logical URIs, but the question of containment is still there.

There is a similar, but perhaps more easily stated issue with SKOS. Consider, for example, the following fragment from the IVOAT thesaurus [2]:

rotating body

Thus asteroid is a narrower term of rotating body. In the SKOS version of this thesarus, we have two concepts, and with triples asserting the appropriate labels, the fact that these concepts occur in the IVOAT scheme and the narrower relationship.

<> rdf:type skos:ConceptScheme ;

:asteroid rdf:type skos:Concept ;
          skos:inScheme <> ;
          skos:prefLabel "asteroid"@en ;

:rotatingBody rdf:type skos:Concept ;
              skos:inScheme <> ;
              skos:prefLabel "rotating body"@en ;
              skos:narrower :asteroid.              

What we don’t have here, however, is the assertion that the narrower relationship occurs within the ConceptScheme. The same also holds of the labels — the labelling of the concept is not explicitly bound to the concept scheme.

Now, this isn’t really a failing of SKOS, but is rather a consequence of the use of RDF for the representation. Solutions to this could involve reification (bleuurgh) or the use of named graphs to identify the triples associated with a ConceptScheme. At the time of the SKOS Recommendation, however, no standard was available.

Does this really matter? Is it an issue? So far, a lot of SKOS publication seems to be organisations exposing their own vocabularies, with instances of skos:Concept appearing in a single skos:ConceptScheme with semantic relationships asserted “within” that scheme and thus under the control of the Scheme “owner”. That may not be too difficult to manage. Things will get more interesting once we have greater use of the SKOS mapping relationships [3], which are intended for use between Concepts in different ConceptSchemes. Such mappings are likely to present different and potentially conflicting points of view or opinions, and we will then require more details of the provenance of the assertions.


  1. OWL API
  2. IVOAT Thesaurus
  3. SKOS Mapping Properties,

Written by Sean Bechhofer

June 16, 2010 at 5:19 pm

Posted in rdf, skos, talks