Archive for the ‘rdf’ Category
A friend of mine, Di Maynard, who works in computational linguistics and NLP, alerted me to cheapbotsdonequick last week, a service that makes it really easy to set up a twitter-bot. It hooks up to a twitter account and will tweet generated messages at regular intervals. The message content is generated via a system called tracery, using a grammar to specify rules for string generation. There are a number of bots around that use this service including some that generate SVG images — @softlandscapes is my favourite. I thought this looked like an interesting and fun idea to explore.
I’d done some earlier raspberry pi-based experiments hooking up to real-time rail information, so I decided to stick with the train theme and develop a bot tweeting “status updates” for Northern Rail. These wouldn’t quite be real updates though.
A tracery grammar contains simple rules that are expanded to produce a final result. Each rule can have a number of different alternatives, which are chosen at random. See the tracery tutorial for more information. For my grammar, I produced a number of templates for simple issues, e.g.
high volumes of X reported at Y
plus some consequences such as re-routing or disruption to catering services. The grammar allows us to put together templates plus rules about capitalisation or plurals etc.
For the terminals of the grammar — the things that appear as X or Y, I pulled lists from an external, third party data source: dbpedia. For those who aren’t aware of dbpedia, it’s a translation of (some of) the data in Wikipedia into a nicely structured form (RDF), which is then made available via a query endpoint. In this case, I used dbpedia’s SPARQL endpoint to query for words to use as terminals in the grammar. There are other open data sources I could have used, but this was one I was familiar with.
This allowed me to get hold of the stations managed by Northern Rail, plus some “causes” of disruption, which I chose to be European Rodents, Amphibians, common household pests and weather hazards. The final grammar was produced programmatically (using python).
The grammar then produces a series of reports, for example:
Wressle closed due to Oriental cockroaches. Replacement bus service from Lostock Gralam.
So, is there anything to this other than some amusement value? Well, not really, but there are perhaps a couple of points of interest. First off, it’s an illustration of the way in which we can make use of third party, open information sources. This is nice because:
- I don’t need to think about lists of European rodents and amphibians or stations served by Northern Rail.
- The actual content of the lists were unseen to me, so the combinations thrown up are unexpected and keep me amused.
- I can substitute in a different collection of stations or hazards and extend when I get bored of hearing about Cretan frogs and Orkney voles.
- The data sources use standardised vocabulary for the metadata (names etc.) so it’s easy to pull out names of things (potentially in other languages).
I teach an Undergraduate unit on Fundamentals of Computation that focuses largely on defining languages through the use of automata, regular expressions and grammars. The grammars here are (more or less) context free grammars, so this gives an amusing example of what we can do with such a construct.
I am now awaiting the first irate email from a traveller who “didn’t go for the train because you said the station was closed due to an infestation of Orkney Voles”.
What’s it doing?
myExperiment is a Virtual Research Environment that supports users in the sharing of digital items associated with their research. Initially targeted at scientific workflows, it’s now being used to share a number of different contribution types. For this particular example, however, I focused on workflows. Workflows are associated with an owner, and owners may also provide information about themselves, for example which country they’re in. The site also keeps statistics about views and downloads of the items. This dataset allows exploration of the relationship between these various factors.
How does it work?
The myExperiment data is made available as a SPARQL endpoint, supporting the construction of client applications that can consume the metadata provided by myExperiment. A few simple SPARQL queries (thanks to David Newman for SPARQLing support) allowed me to grab summary information about the numbers of workflows in the repository, their formats, and which country the users came from. The myExperiment endpoint will deliver this information as CSV files, so it’s just then a case of packaging these results up with some metadata and then uploading to the Public Data Explorer. Hey presto, pretty pictures!
The Explorer expects time series data in order to do its visualisations. The data I’m displaying is “snapshot” data, so there’s only one timepoint in the time series — 2011. We can still get some useful visualisations out though, allowing us to explore the relationships between country of origin, formats, and numbers of downloads and views.
This is, to quote Peter Snow on Election Night, “just a bit of fun” — I’ve made little attempt to clean the data, and there will be users who have not supplied a country, so the data is not complete and shouldn’t necessarily be taken as completely representative of productivity of the countries involved! In addition, the data doesn’t use country identifiers that the data visualiser knows about, and has no lat/long information, so mapping isn’t available (there are also some interesting potential issues with the use of England and United Kingdom). However, it’s a nice example of plumbing existing pieces of infrastructure together in a lightweight way. Although this was produced in a batch mode, in principle this should be easy to do dynamically.
So, Terry, the results please…
“Royaume Uni, douze points”.
Continuing the theme of reflecting on SKOS, the question of organisation is next. SKOS provides an RDF vocabulary for describing Knowledge Organisation Systems and there’s an assumption that SKOS is RDF from the ground up. The use of RDF brings advantages, but there are also limitations, in particular when we consider issues of containment. This is something that I wrestled with in the past when building the OWL-API libraries to support OWL . In the RDF/XML serialisations of OWL, there was no explicit connection between the axioms stated in an ontology and the Ontology object itself. This can cause difficulties in the face of
owl:imports as there was also no explicit link between the location where an RDF graph that represents and ontology is retrieved from and the URI of the Ontology itself. This was partly solved by the use of physical and logical URIs, but the question of containment is still there.
There is a similar, but perhaps more easily stated issue with SKOS. Consider, for example, the following fragment from the IVOAT thesaurus :
asteroid is a narrower term of
rotating body. In the SKOS version of this thesarus, we have two concepts,
http://www.ivoa.net/rdf/Vocabularies/IVOAT#asteroid with triples asserting the appropriate labels, the fact that these concepts occur in the IVOAT scheme and the narrower relationship.
What we don’t have here, however, is the assertion that the narrower relationship occurs within the ConceptScheme. The same also holds of the labels — the labelling of the concept is not explicitly bound to the concept scheme.
Now, this isn’t really a failing of SKOS, but is rather a consequence of the use of RDF for the representation. Solutions to this could involve reification (bleuurgh) or the use of named graphs to identify the triples associated with a ConceptScheme. At the time of the SKOS Recommendation, however, no standard was available.
Does this really matter? Is it an issue? So far, a lot of SKOS publication seems to be organisations exposing their own vocabularies, with instances of
skos:Concept appearing in a single
skos:ConceptScheme with semantic relationships asserted “within” that scheme and thus under the control of the Scheme “owner”. That may not be too difficult to manage. Things will get more interesting once we have greater use of the SKOS mapping relationships , which are intended for use between Concepts in different ConceptSchemes. Such mappings are likely to present different and potentially conflicting points of view or opinions, and we will then require more details of the provenance of the assertions.