A shot at RDF schema discovery

So you wanna know about my RDF data… sure! here is a SPARQL endpoint URL, now you can go and find everything you want. Dude, it’s RDF! you can query both the data and the schema… All what you need is there. Even better it’s Linked Data and -damn yeah- all the resources have got their cool URIs… now as a bee jumping happily between flowers you can navigate through this information. Serendipity and follow-your-nose at their bests, isn’t it?

Now that’s true but not as pleasant as it sounds. Understanding the data and the schema in order to be able to write meaningful queries is a tedious task that involves “tons” of SPARQL queries and imposes a significant mental overload… Following HTTP links between Linked Data resources is not less tedious.

One of the approaches to describe RDF datasets and ease navigation, discovery and querying is VoID, a vocabulary to describe RDF datasets… it includes a number of useful properties that help getting an idea about the content of an RDF dataset and accessing it. Example properties include void:vocabulary, void:exampleResource and void:sparqlEndpoint. However, Linked Data datasets usually use terms from different vocabularies and do not follow a particular “schema”. In other words, knowing the vocabularies doesn’t give you a sufficient idea about what is in the dataset (there is no equivalent of the XML worlds’ XSD or DTD).

So What?

I built a simple service that generates a summary of an RDF dataset based on its VoID description. The summary is a tree rooted at a void:rootResource or void:exampleResource resource. It is available on Google App Engine. You can start by a VoID description or by inputing a SPARQL endpoint URL and the URI of a resource to start with.

The screenshot below shows an example based on the VoID description of the “BBC backstage Dataset”
example of schema from BBC music data

The tree starts at a resource(or set of resources) and includes all its neighbours… it repeats this for every node to a specified limit (set through the depth URL parameter). On its way it aggregates values when multiple values result from a (subject,predicate) pair or (set of subjects,predicate) pair.

  • Hover over the tree nodes to see the actual values
  • Click on resource nodes (colored violet) to navigate
  • Bordered nodes represent multiple resources/literals

A bit of background

This is similar to the work described in Representative Objects: Concise Representations of Semistructured, Hierarchical Data. The paper, published in 1997 by Stanford researchers, formalizes the process of summarizing XML data based on automata theory (this in fact can be seen as an NFA to DFA transformation)

//TODO list

  1. Blank nodes are ignored
  2. Share code on gihub
  3. How this can be integrated in SPARQL query editor or Linked Data browser?

* The Javascript used to plot the tree is based on Google Refine code (in particular schema alignment code)

Update: 11/05/2011 SPARQL queries used are rewritten resulting in a significant performance improvement try it yourself
Update: 13/09/2011 Hide literals option is added try it yourself