Exhibit Extension for Open Refine

This is a quick tutorial on ow to use the Exhibit extension of Open Refine (download here). To learn how to install Open Refine extensions, check this page.

Before I start with the tutorial, here is a quick description of the extension from the README.md of thegithub repository:

This is an Open Refine extension that allows you to export the data along with the facets as an Exhibit 3.0 web page. This can be seen as taking a snapshot of your open refine project. You can then save it, publish it online, or share it with others without requiring Open Refine.

The exported Exhibit will contain:

  • List and numeric facets added at the moment of exporting (even facets that use more complex GREL expressions).
  • Filters applied to the data will be applied to the exported data.

I think it is cool that this extension brings two of David Huyhn‘s creations together.

Now, on to the tutorial… I am using a dataset on used cars prices containing 845 cars.

cars

I added two facets to allow filtering by the cars makes and models. I also added a numeric facet to filter by year. Finally, to allow filtering by engine type (petrol or diesel) regardless of the capacity, I added a custom facet as shown below:

grel-facet

As an example, I filtered the cars to four specific models resulting on 313 matching cars.

cars-filtered

I have the Exhibit extension installed, which can be verified by looking at the export menu (last option is Exhibit):

menu

If I click on Exhibit I get a zipped archive containing three files: index.html, data.json, and style.css. Openning the index.html using Firefox (Note: Exhibit doesn’t work on Chrome locally) I get the following:

exhibit

This can be seen as a snapshot of the Refine data that is easy to share and publish. Notice that it contains only the filtered rows and that it scontains both text and numeric facets. It also supports GREL expressions in facets fine.

I hope people find this useful. Report any issue or feature request at Github.


Sending non-standard SPARQL queries with Apache Jena

Some non-standard extensions of SPARQL results in queries that are not syntactically compliant with SPARQL specifications. An example of such an extension is the Virtuoso full text search that uses a “magic” prefix bif without defining it in the query.

When using non-compliant extension in Apache Jena, a call to QueryExecutionFactory.sparqlService will fail. A workaround that worked for me is to use QueryEngineHTTP class as in:

  QueryEngineHTTP qExec = new QueryEngineHTTP(sparqlEndpointUrl, sparql);
  ResultSet res = qExec.execSelect();

Apache Jena on Google App Engine

Jena uses Http Client to query remote endpoints. When deploying on Google App Engine, you cannot use HttpClient. Here is a quick workaround that worked fine for me:


private ResultSet execSelect(String sparql, String endpoint) throws Exception{ HttpURLConnection connection = (HttpURLConnection) new URL(endpoint + "?query="+URLEncoder.encode(sparql, "UTF-8")).openConnection(); connection.setRequestProperty("Accept", "application/sparql-results+xml"); InputStream response = connection.getInputStream(); return ResultSetFactory.fromXML(response); }

 


A Faceted Browser over SPARQL Endpoints

I have recently been working on (yet another?) faceted browser over RDF data… more precisely RDF data loaded in a SPARQL endpoint that support COUNT and GROUP BY queries. I have successfully used it against Fuseki, Talis platform(tested against http://api.talis.com/stores/bbc-backstage/services/sparql) and Virtuoso (tested against http://dbpedia.org/sparql)

Main characteristics:

  1. Configurable: most aspects of the browser is configurable through two JSON files (configuration.json and facets.json). This includes basic templating & styling ability. To change the style, add a facet or browse a completely different data; just edit the json files accordingly and reload the page
  2. No preprocessing required: as all request are standard SPARQL queries… nothing is required a priori neither on the publisher nor on the consumer end
  3. Facets are defined as triple patterns (see example) therefore facets values don’t need to be necessarily directly associated with the items browsed i.e. they can be a few hops of the browsed items

see the screenshot below to get a feeling of the browser…

If you have used Google Refine before, the resemblance is probably clear. Indeed, I am reusing the JavaScript and CSS code that makes the facets part of Google Refine (they are gratefully under New BSD License.. how much I love open source!!!)

Having it running

The code is shared on github. grab it, deploy it to a Java application server and then play with the configuration files  (a number of examples are provided with the source)

Outlook and //TODO

The most exciting part to me (thanks to the Google Refine inspiration) is that all the needed knowledge about the endpoint and the facets are maintained as JSON on the client side and communicated with the server upon each request. If a user somehow update this configuration and submit to the server, she can customise her view (as an example I added a change button to each facet which allow the user to change only the facet she sees… potentially an add facet button can be added)

Additionally, a list of issues and features to add is on the github repository.

comments/feedback are very warmly welcomed 🙂

 

Update 19/03/2012:  support for fulltext search facets added. Currently supports Virtuoso and standard SPARQL (using regex). See example


Thesis Submitted… Mission Accomplished

I submitted my thesis today… its title is “Getting to the Five-Star: From Raw Data to Linked Government Data”

Thanks to Google Doc it is online now at this scary link.

Now honestly for me it is an interesting read 🙂 but I know you guys are busy people so if you don’t want to read it all I recommend chapters 3 and 9.

lemme know how do you find it.

P.S. read also the Acknowledgment part coz I feel awkward replicating it here


A shot at RDF schema discovery

So you wanna know about my RDF data… sure! here is a SPARQL endpoint URL, now you can go and find everything you want. Dude, it’s RDF! you can query both the data and the schema… All what you need is there. Even better it’s Linked Data and -damn yeah- all the resources have got their cool URIs… now as a bee jumping happily between flowers you can navigate through this information. Serendipity and follow-your-nose at their bests, isn’t it?

Now that’s true but not as pleasant as it sounds. Understanding the data and the schema in order to be able to write meaningful queries is a tedious task that involves “tons” of SPARQL queries and imposes a significant mental overload… Following HTTP links between Linked Data resources is not less tedious.

One of the approaches to describe RDF datasets and ease navigation, discovery and querying is VoID, a vocabulary to describe RDF datasets… it includes a number of useful properties that help getting an idea about the content of an RDF dataset and accessing it. Example properties include void:vocabulary, void:exampleResource and void:sparqlEndpoint. However, Linked Data datasets usually use terms from different vocabularies and do not follow a particular “schema”. In other words, knowing the vocabularies doesn’t give you a sufficient idea about what is in the dataset (there is no equivalent of the XML worlds’ XSD or DTD).

So What?

I built a simple service that generates a summary of an RDF dataset based on its VoID description. The summary is a tree rooted at a void:rootResource or void:exampleResource resource. It is available on Google App Engine. You can start by a VoID description or by inputing a SPARQL endpoint URL and the URI of a resource to start with.

The screenshot below shows an example based on the VoID description of the “BBC backstage Dataset”
example of schema from BBC music data

The tree starts at a resource(or set of resources) and includes all its neighbours… it repeats this for every node to a specified limit (set through the depth URL parameter). On its way it aggregates values when multiple values result from a (subject,predicate) pair or (set of subjects,predicate) pair.

  • Hover over the tree nodes to see the actual values
  • Click on resource nodes (colored violet) to navigate
  • Bordered nodes represent multiple resources/literals

A bit of background

This is similar to the work described in Representative Objects: Concise Representations of Semistructured, Hierarchical Data. The paper, published in 1997 by Stanford researchers, formalizes the process of summarizing XML data based on automata theory (this in fact can be seen as an NFA to DFA transformation)

//TODO list

  1. Blank nodes are ignored
  2. Share code on gihub
  3. How this can be integrated in SPARQL query editor or Linked Data browser?

* The Javascript used to plot the tree is based on Google Refine code (in particular schema alignment code)

Update: 11/05/2011 SPARQL queries used are rewritten resulting in a significant performance improvement try it yourself
Update: 13/09/2011 Hide literals option is added try it yourself


Blogging!!

I will start blogging