Humbly Report: Sean Bechhofer

Semantics 'n' stuff

All the World’s a Stage

with 2 comments

Jason Groth Wigs Out

Anyone who knows me is probably aware of the fact that I’m a keen amateur* musician. So I was very pleased to be able to work on a musical dataset while spending some sabbatical time at OeRC with Dave De Roure. The project has been focused around the Internet Archive‘s Live Music Archive. The Internet Archive is a “non-profit organisation building a library of internet sites and other cultural artifacts in digital form”. They’re the folks responsible for the Way Back Machine, the service that lets you see historical states of web sites.

The Live Music Archive is a community contributed collection of live recordings with over 100,000 performances by nearly 4,000 artists. These aren’t just crappy bootlegs by someone with a tapedeck and a mic down their sleeve either — many are taken from direct feeds off the desk or have been recorded with state of the art equipment. It’s all legal too, as the material in the collection has been sanctioned by the artists. I first came across the archive several years ago — it contains recordings by a number of my current favourites including Mogwai, Calexico and Andrew Bird.

Our task was to take the collection metadata and republish as Linked Data. This involves a couple of stages. The first is to simply massage the data into an RDF-based form. The second is to provide links to existing resources in other data sources. There are two “obvious” sources to target here, MusicBrainz, which provides information about music artists, and GeoNames, which provides information about geographical locations. Using some simple techniques, we’ve identified mappings between the entities in our collection and external resources, placing the dataset firmly into the Linked Data Cloud. The exercise also raised some interesting questions about how we expose the fact that there is an underlying dataset (the source data from the archive) along with some additional interpretations on that data (the mappings to other sources). There are certainly going to be glitches in the alignment process — with a corpus of this size, automated alignment is the only viable solution — so it’s important that data consumers are aware of what they’re getting. This also relates to other strands of work about preserving scientific processes and new models of publication that we’re pursing in projects like wf4ever. I’ll try and return to some of these questions in a later post.

So what? Why is this interesting? For a start, it’s a fun corpus to play with, and one shouldn’t underestimate the importance having fun at work! On a more serious note, the corpus provides a useful resource for computational musicology as exemplified by activities such as MIREX. Not only is there metadata about large number of live performances with links to related resources, but there are links to the underlying audio files from those performances, often in hgh quality audio formats. So there is an opportunity here to combine analysis of both the metadata and audio. Thus we can potentially compare live performances by individual artists across different geographical locations. This could be in terms of metadata — which artists have played in which locations (see the network below) and does artist X play the same setlist every night? Such a query could also potentially be answered by similar resources such as http://www.setlist.fm. The presence of the audio, however, also offers the possibility of combining metadata queries with computational analysis of the performance audio data — does artist X play the same songs at the same tempo every night, and does that change with geographical location? Of course this corpus is made up of a particular collection of events, so we must be circumspect in deriving any kind of general conclusions about live performances or artist behaviour.

Who Played Where?

The dataset is accessible from http://etree.linkedmusic.org. There is a SPARQL endpoint along with browsable pages delivering HTML/RDF representations via content negotation. Let us know if you find the data useful, interesting, or if you have any ideas for improvement. There is also a short paper [1] describing the dataset submitted to the Semantic Web Journal. The SWJ has an open review process, so feel free to comment!

REFERENCES

  1. Sean Bechhofer, David De Roure and Kevin Page. Hello Cleveland! Linked Data Publication of Live Music Archives. Submitted to the Semantic Web Journal Special Call for Linked Dataset Descriptions.

*Amateur in a positive way in that I do it for the love of it and it’s not how I pay the bills.

Written by Sean Bechhofer

May 23, 2012 at 1:23 pm

Posted in linked data, music, rdf

Tagged with

What’s the Story, Morning Glory?

with one comment

What's the Story, Morning Glory?

I went to the sameAs meeting in London this week, where the theme of the meeting was storytelling. It’s the first time I’ve been to a sameAs meetup (I’m in Oxford at OeRC for the next few weeks and it’s a bit easier to get through to London from here than from Manchester) and it was an interesting evening.

As one of three talks, science writer and blogger Ed Yong told a tale that started 150 million years ago with a mayfly in some mud, and ended up with a scientist wandering around lost in a swamp [1]. The (ultimately successful) search resulted in a publication [2], but one of Yong’s points was that the (potentially interesting) back story about the search leading to the discovery of the fossil wasn’t related in the paper. Should it?

To answer that question we’d have to think “is it important to the science that’s being presented in the paper”, or perhaps more concretely “will including this make it more likely that the paper will be accepted for publication”. For a majority of publication outlets, the answer to that is probably a no. But it certainly belongs somewhere — if nothing else, it provides a human side to the work that would help in public engagement or dissemination. Yong suggested that perhaps such information should be included in supplementary material. Many scientists are also now bloggers, so an obvious option is that we tell these additional stories through our blogs.

A question asked after the talk was whether narrative was really crucial to scientific papers. In my opinion (and based on my admittedly narrow experience of writing Computer Science papers) it certainly is — having a clear story to tell is vital if we are to write good, readable scientific papers. That doesn’t necessarily mean to say that we include all of the contextual detail (for example, stumbling lost around a swamp), but we do need a story to guide the reader.

As highlighted in some of the discussion after Yong’s talk, the way in which the story is told in a paper often doesn’t represent the true nature investigation. We may have gone down blind alleys, backtracked, repeated or redesigned experiments along the way. So the final paper presentation often isn’t a chronologically accurate description of the process. The story can get chopped up and reconstituted with a post hoc presentation of the timeline. That retelling of the story may end up losing some key information for those wishing to understand the process that the authors went through.

Work that we’re currently pursuing in the Wf4Ever project is addressing (some of) these issues. The project is investigating the use of Research Objects [3] to aggregate and bundle together the resources that are used in a scientific investigation. In particular, we’re focusing on two domains (genomics and astronomy) that make use of scientific workflows to code up and execute analyses that are taking place in an investigation. The hope is that by bundling together the context (in terms of the method/workflow, data sets, parameters, provenance information about data, workflow traces etc), a researcher has a better chance of understanding what took place and in turn building on those results, supporting reproducible science [4]. Other related work aims to define executable papers (e.g. Elsevier’s Executable Paper Grand Challenge [5]) that allow validation of code and data. The FORCE11 group [6] also see the notion of Research Object as replacing or superceding traditional paper publication.

Of course, even an enhanced publication still needs a good narrative and a story. Perhaps though our publications of tomorrow will include not just the text and arguments, but also the data, methods, and GPS tracks of a researcher lost in the woods….

REFERENCES

  1. Treasure hunt ends with a stunning fossil of a flying insect. Ed Yong, Not Exactly Rocket Science http://blogs.discovermagazine.com/notrocketscience/2011/04/04/treasure-hunt-ends-with-a-stunning-fossil-of-a-flying-insect/.
  2. Late Carboniferous paleoichnology reveals the oldest full-body impression of a flying insect. R.J.Knecht et. al. PNAS 108(16) pp.6515–6519. http://dx.doi.org/10.1073/pnas.1015948108
  3. Linked Data is Not Enough for Scientists, S. Bechhofer et. al. Future Generation Computer Systems, 201110.1016/j.future.2011.08.004
  4. Accessible Reproducible Research. J. Mesirov. Science 327 (5964) pp.415–416. http://dx.doi.org/10.1126/science.1179653
  5. Executable Papers Grand Challenge http://www.executablepapers.com/.
  6. Improving Future Research Communication and e-Scholarship. Phil Bourne, Tim Clark et. al.FORCE11 Manifesto.

Written by Sean Bechhofer

February 22, 2012 at 5:36 pm

Posted in projects, research objects

Tagged with ,

Sparky’s Magic Piano

leave a comment »

Not actually the Magic Piano.....

In the week before Christmas, I attended the Digital Music Research Network meeting at Queen Mary, University of London. Digital Music research is not an area I’m currently involved with, but I went to the meeting at the suggestion of Dave De Roure. I’ll be spending some sabbatical time with Dave in Oxford this year and one of the things we’re going to be looking at is whether we can apply the technologies and approaches being developed in other project (in particular the Research Objects of Wf4Ever) to tasks like Music Information Retrieval. I’m also excited about this as it fits with some of my extra-curricular interests in music. The mix of the technical and artistic (in terms of both content and people) reminded me of Hypertext conferences that I went to back in ’99 and ’00.

Although some of the talks were a long way from my expertise, I found a few of particular interest. The opening keynote from Elaine Chew discussed some of the issues involved in conducting research — for example ensuring that work leads to publication (and publications that “count”), credit is given for researchers involved in the work, and that work is sustainable. This was illustrated with some fascinating video footage of experiments with a piano duo, investigating how the introduction of delay affects the interaction and interplay between performers.

Gyorgy Fazekas presented the studio ontology — a model that builds on earlier work on a Music Ontology by Yves Raimond. At first sight, the ontology seems fairly lightweight (largely asserted taxonomy), but given my own interests in Semantic Web technologies, this is clearly an area for further investigation.

The jewel in the crown, however, was Andrew McPherson‘s work on Electronic augmentation of the acoustic grand piano. The magnetic resonator piano uses electromagnets to induce string vibrations. For those of you familiar with the EBow, used by guitarists including Bill Nelson and Robert Fripp, it’s like a piano with 88 EBows bolted on to it. A keyboard sensor (I believe using a Moog Piano Bar) captures data from the keys and drives the system. The whole thing requires no alteration to the instrument, and can be set up in a few hours. It’s an electronic instrument, but all the sound is produced using the physical soundboard and strings of the instrument itself (i.e. no amplifier/speakers).

The overall effect is a little like an organ, with infinite sustain of notes, but many more subtle effects can be obtained including string “bending” and the introduction of additional harmonic tones. Andrew gave a demonstration of the instrument over lunch. One regret I have is that performance anxiety kicked in here (I’m a fairly rudimentary pianist) and I didn’t rush forward to have a go when he offered it to the floor! And I hadn’t brought a camera. Videos on Andrew’s site show the instrument in action.

One aspect here is the use of various gestures. Electronic keyboards have facilities like aftertouch, allowing the player to add additional pressure to the keys to control the additional tones/effects. This is possible here, with other gestures such as sliding the fingers along or up and down the keys being used to “play” the instrument. In the talk, Andrew described some additional work he was doing on providing enhanced keyboard controllers to support these additional gestures. The piano keyboard is a ubiquitous controller/interface to a musical instrument — it will be interesting to see how these additional gestures and controls fit in with players’ established practices, and which gestures are “right” for which effects.

Of course, the obvious question that we then all asked was what other instruments one could apply this approach to. Answers on a postcard……

Written by Sean Bechhofer

January 6, 2012 at 1:12 pm

Posted in music, workshop

Tagged with ,

School’s in for Summer

leave a comment »

View from Puerto de Fuenfria

I’ve just spent a week at the Summer School on Ontology Engineering and the Semantic Web (SSSW’11) in Cercedilla, Spain.

Dating back from the early days of the OntoWeb Network, this is the eighth time the school has run, and over the years, the content has been finely tuned to provide a balance of practical and theoretical work, with both individual and teamwork elements. The programme includes invited talks, technical lectures, hands-on sessions, and a mini-project that has students working in small groups to develop an idea over the course of the week. A poster session also provides attendees the opportunity to present their work and gain valuable feedback.

Invited speakers this year included Jim Hendler, Steffen Staab, Harith Alani, Peter Mika, Martin Hepp and Oscar Corcho (stepping in at the last minute for an indisposed Manfred Hauswirth). All gave interesting talks, with Martin Hepp’s being perhaps the most lively — starting a talk with the claim that the Semantic Web was facing a similar collapse as that suffered by Constantinople is an interesting opening gambit. And any talk that mentions Horse Muesli is going to get my attention.

With nine tutors present all week, and six invited speakers, most of whom spend a few days in Cercedilla, there are many opportunities for one-to-one and detailed discussions. As ever (this is the sixth Cercedilla school I’ve tutored at), it was a pleasure to be involved, and to spend time interacting and being with fifty very smart, motivated people (it’s quite nice hanging out with the other tutors too). Last but not least, as well as the technical content, the school puts a strong emphasis on a social programme that involves everybody. This makes for an action packed, but pretty exhausting week (especially for the over 40s).

The week ends with a series of miniproject presentations on the Saturday morning. Topics this year were as varied as in past years, including ontology mapping, holidays from hell, analysis of twitter behaviour, e-commerce ontologies, patterns for modularisation, research ratings and a modern take on the I-Spy books from the 50s and 60s. The groups also made short videos which were shown after Friday night’s dinner. These were mostly light-hearted rather than technical, but they did showcase some additional artistic skills! The winning video included novel use of the iPhone Word Lens app.

There was a twitter stream running during the meeting, but it was refreshing to see that an audience actually following talks rather than sitting absorbed in laptops! The talks, hands-on sessions and projects had quite a strong Linked Data feel to them, but as @rich_francis tweeted, it was

Great to see two families (linked data and Ontology Engineering) coming together #sssw11

For some reason, travel home from Madrid is never simple, and there was a crazy mad dash through LHR to make a connection, but that’s another story. I just suggest that you never travel with Fabio……

Congratulations to Enrico Motta, Asun Gomez-Perez, Oscar Corcho, Mathieu D’Aquin and all the local UPM team for the management and organisation, and we’ll look forward to SSSW 2012. But now I need to sleep for a week.

Written by Sean Bechhofer

July 18, 2011 at 5:10 pm

Posted in owl, teaching

Tagged with

Future Everything

leave a comment »

Future Everything 2011

Future Everything 2011

I attended some of the ideas sessions at FutureEverything in Manchester last month (thanks to @julianlstar). I particularly enjoyed the session on Linked Data/Linked Stories. Chris Taggart opened with some reflections about data that’s been released and is available on Openly Local. His comment was that a few years ago, the presence of this data would have been a story Council spends X million. Now, it’s just data, and the stories are about teasing information out of that data. He also highlighted cases where data had been redacted (due, I believe, to there being personal information involved). However, in that list there are some big ticket items — £50K for hospitality and trading services. Is there a story here….? With this move to opening up the data, the omission of information can become as important as the inclusion.

Martin Belam and David Higgerson also gave interesting position statements, but it was something that Paul Bradshaw said that stuck with me most. He described the steps in data journalism as being:

  • Compile
  • Clean
  • Connect
  • Communicate

Communication here is in particular about how one visualises the information and provides something that is somehow personalised — we can sometimes lose the notion of the individual when considering masses of numbers. Higgerson also talked about the importance of narrative in presenting a story — plotting and charting data is not enough, and context is key (but then that’s the case for any statistical treatment I guess). The thing that struck me here was that this was pretty much the same steps that we go through as scientists conducting research. When writing papers, it’s often the story or narrative that’s the hard thing to get right. Maybe this should be unexpected — after all data journalism and scientific research shouldn’t be that far apart, but it’s nice when these connections pop up.

Other sessions that I found interesting were on hacking culture and Sue Thomas talking about Creative Truancy, although I’m that keen on bringing too much cyberspace into the natural world. One of the reasons I like being on top of a mountain or under the sea is precisely because I’m disconnected from other things, and the view/fish/whatever can command all my attention.

Food for Art

On the arts side, one piece that I particularly enjoyed The Food of Art. A collection of posters that made me laugh out loud — and that’s always a sign of good art for me! It seems there are quite a lot of calories in a dead deer…..

Written by Sean Bechhofer

June 9, 2011 at 5:11 pm

Posted in conference

Tagged with

The Eurovision Workflow Contest

with one comment

Ever wondered where the workflows that are most downloaded or viewed in myExperiment come from? Wonder no longer! Here’s a nifty visualisation using Google’s Public Data Explorer:

myExperiment Statistics

What’s it doing?

myExperiment is a Virtual Research Environment that supports users in the sharing of digital items associated with their research. Initially targeted at scientific workflows, it’s now being used to share a number of different contribution types. For this particular example, however, I focused on workflows. Workflows are associated with an owner, and owners may also provide information about themselves, for example which country they’re in. The site also keeps statistics about views and downloads of the items. This dataset allows exploration of the relationship between these various factors.

How does it work?

The myExperiment data is made available as a SPARQL endpoint, supporting the construction of client applications that can consume the metadata provided by myExperiment. A few simple SPARQL queries (thanks to David Newman for SPARQLing support) allowed me to grab summary information about the numbers of workflows in the repository, their formats, and which country the users came from. The myExperiment endpoint will deliver this information as CSV files, so it’s just then a case of packaging these results up with some metadata and then uploading to the Public Data Explorer. Hey presto, pretty pictures!

The Explorer expects time series data in order to do its visualisations. The data I’m displaying is “snapshot” data, so there’s only one timepoint in the time series — 2011. We can still get some useful visualisations out though, allowing us to explore the relationships between country of origin, formats, and numbers of downloads and views.

This is, to quote Peter Snow on Election Night, “just a bit of fun” — I’ve made little attempt to clean the data, and there will be users who have not supplied a country, so the data is not complete and shouldn’t necessarily be taken as completely representative of productivity of the countries involved! In addition, the data doesn’t use country identifiers that the data visualiser knows about, and has no lat/long information, so mapping isn’t available (there are also some interesting potential issues with the use of England and United Kingdom). However, it’s a nice example of plumbing existing pieces of infrastructure together in a lightweight way. Although this was produced in a batch mode, in principle this should be easy to do dynamically.

So, Terry, the results please…

“Royaume Uni, douze points”.

Boom Bang-a-Bang!

Written by Sean Bechhofer

March 16, 2011 at 5:30 pm

Posted in rdf, visualisation

Tagged with

Whale Shark 2.0

with one comment

The fishDelish project is a JISC funded collaboration between the University of Manchester, Hedtek Ltd and the FishBase Information and Research Group Inc. FishBase is a “a global information system with all you ever wanted to know about fishes.” FishBase is available as a relational database and the project is about taking that data and republishing as RDF/Linked Data. The project is nearing the end, and we now have the FishBase data in a triple store. I took a look at how we could generate some nice looking species pages. FishBase currently offers pages presenting information about species (for example the Whale Shark).

Whale Shark on FishBase

I wanted to try and replicate (some of) this presentation in as simple/lightweight a way as possible. The solution I adopted involves a single SPARQL query that pulls out relevant information about a species, and an XSL stylesheet that transforms the results of that query into an HTML page. The whole thing is tied together with a simple bit of PHP code that executes the SPARQL query (using RAP — a bit long in the tooth, but it does this job), requesting the results as XML. It then uses PHP’s DOMDocument to add a link to the XSL stylesheet into the results. The HTML rendering is then actually handled by the web browser applying the style sheet. The resulting species pages (e.g. the Whale Shark again) are not — to use the words of David Flanders, our JISC Programme Manager — as information rich as the original FishBase pages, but they are sexier.

Whale Shark on fishDelish

To a certain extent, that’s simply down to styling (I’m a big fan of Georgia), but the exercise did help to explore the usage of SPARQL and XSL on the FishBase dataset. The SPARQL queries and stylesheets developed will also be useful in conjunction with the mysparql libraries developed in fishDelish. mySparql is a service developed in fishDelish that allows the embedding of SPARQL queries into pages.

The first problem I faced was trying to understand the structure of the data in the triplestore. The property names produced by D2R are not always entirely, ermm, readable. As my colleague Bijan Parsia discussed in a blog post describing his fishing expeditions, the state of linked data “browsers” is mixed. I ended up using Chris Gutteridge’s Graphite “Quick and Dirty” RDF Browser to help navigate around the data set.

A second question was how to approach the queries. The species pages have a simple structure. They have a single “topic” (i.e. the species), and then display characteristics of that species. So constructing a species page can be seen as a form filling process where the attributes are predetermined. It’s possible to write a SPARQL query to get information about a species with a single row in the results. The stylesheet (e.g. for species) can grab the values out of those results and “fill in the blanks” as required. An alternative would be to use some kind of generic s-p-o pattern in the query and pull out all the information about a particular URI (i.e. the species). In the species case though, we already know what information we’re interested in getting out so the “canned” approach is fine.

I also produced some pages for Orders and Families (e.g. Rhincodontidae or Rajiformes). The SPARQL query here returns a number of rows, as the query asks for all the families in an order or species in a family. There is redundancy in the query result as the first few columns in each row are identical. A cleaner solution here might be to use more than one SPARQL query — one pulling out the family information, one requesting the family/species list. That would require more sophisticated processing though, rather than my lightweight SPARQL query + XSL approach. Again, this is something that the mysparql service would help with.

Overall, this was in interesting experiment and exercise in understanding the FishBase RDF data. Harking back to an earlier blog post from Bijan, as I’m already familiar with SPARQL and XSL, it was probably easier for me to produce these pages using the converted data, but it’s not clear whether that would be true in general. There’s actually very little in here that’s about Linked Data. This could have been done (as the current FishBase pages are done) using the existing relational db plus some queries and a bit of scripting. There was some benefit here in using the standardised protocols and infrastructure offered by SPARQL, RDF, XML and XSL though. It was also very easy for me to do all of this on the client side — all I needed was access to the SPARQL endpoint and some XML tooling. So the real benefit for this particular application is gained from the data publication.

It did help to illustrate the kinds of things we can now begin to do with the RDF data though, and puts us in a situation where we can look at further integration of the data with other data sets. For example it would be nice to hook into resources like the BBC Wildlife Finder pages, which are also packed with semantic goodness.

It was also fun, which is always a good thing! If only the Whale Sharks themselves were as easy to find…..

(This is an edited version of an fishDelish project blog post)

Written by Sean Bechhofer

March 10, 2011 at 10:48 am

Posted in linked data, rdf

Tagged with , ,