Exhibit Extension for Open Refine

This is a quick tutorial on ow to use the Exhibit extension of Open Refine (download here). To learn how to install Open Refine extensions, check this page.

Before I start with the tutorial, here is a quick description of the extension from the README.md of thegithub repository:

This is an Open Refine extension that allows you to export the data along with the facets as an Exhibit 3.0 web page. This can be seen as taking a snapshot of your open refine project. You can then save it, publish it online, or share it with others without requiring Open Refine.

The exported Exhibit will contain:

  • List and numeric facets added at the moment of exporting (even facets that use more complex GREL expressions).
  • Filters applied to the data will be applied to the exported data.

I think it is cool that this extension brings two of David Huyhn‘s creations together.

Now, on to the tutorial… I am using a dataset on used cars prices containing 845 cars.


I added two facets to allow filtering by the cars makes and models. I also added a numeric facet to filter by year. Finally, to allow filtering by engine type (petrol or diesel) regardless of the capacity, I added a custom facet as shown below:


As an example, I filtered the cars to four specific models resulting on 313 matching cars.


I have the Exhibit extension installed, which can be verified by looking at the export menu (last option is Exhibit):


If I click on Exhibit I get a zipped archive containing three files: index.html, data.json, and style.css. Openning the index.html using Firefox (Note: Exhibit doesn’t work on Chrome locally) I get the following:


This can be seen as a snapshot of the Refine data that is easy to share and publish. Notice that it contains only the filtered rows and that it scontains both text and numeric facets. It also supports GREL expressions in facets fine.

I hope people find this useful. Report any issue or feature request at Github.

Sending non-standard SPARQL queries with Apache Jena

Some non-standard extensions of SPARQL results in queries that are not syntactically compliant with SPARQL specifications. An example of such an extension is the Virtuoso full text search that uses a “magic” prefix bif without defining it in the query.

When using non-compliant extension in Apache Jena, a call to QueryExecutionFactory.sparqlService will fail. A workaround that worked for me is to use QueryEngineHTTP class as in:

  QueryEngineHTTP qExec = new QueryEngineHTTP(sparqlEndpointUrl, sparql);
  ResultSet res = qExec.execSelect();

Apache Jena on Google App Engine

Jena uses Http Client to query remote endpoints. When deploying on Google App Engine, you cannot use HttpClient. Here is a quick workaround that worked fine for me:

private ResultSet execSelect(String sparql, String endpoint) throws Exception{ HttpURLConnection connection = (HttpURLConnection) new URL(endpoint + "?query="+URLEncoder.encode(sparql, "UTF-8")).openConnection(); connection.setRequestProperty("Accept", "application/sparql-results+xml"); InputStream response = connection.getInputStream(); return ResultSetFactory.fromXML(response); }


Data marketplace… is it too big a thing to be tackled in a whole?

This my second post on data marketplaces… unfortunately triggered by the bad news of Talis’s winding Kasabi down. There are a number of good posts discussing this and its meaning to the Semantic Web and Linked Data efforts. I’d like to share my ideas here but focusing on the data markeplace side of the story.

In his blog post, Tim Hodson wrote:

So we were too early. We had a vision for easy data flow into and out of organisations, where everyone can find what they need in the form that they need it through the use of linked data and APIs, and where those data streams could be monetized and data layers could add value to your datasets

The previous quote aptly captures the essential aspects of data marketplaces. In its richest form, a data marketplace enables buying/selling access to quality data provided by different publishers (essential aspects are in bold).

Tim went on to say:

Other organisations besides Talis, sharing similar visions, have all had to change the way they present themselves as they realise that the market is simply not ready for something so new.

So I looked at a number of existing data marketplaces and see how they present themselves. It is hard to identify what exactly is a data marketplace, however I am including these mainly based on Paul Miller’s podcasts:

  • AggData.com: sells lists crawled from the Web as downloadable files.
  • Datafiniti: sells data crawled from the Web through SQL-like interface.
  • Microsoft Azure Data Marketplace: sells data from a number of publishers via API access based on OData.
  • Infochimps: sells data from a number of publishers via a mix of downloads and API access.
  • datamarket.com: sells only numeric data provided by a number of publishers. It focuses mainly on visualization but also provides API access.
  • Factual: collects data (mainly related to locations) and sells API access to the data.
  • Kasabi: sells API access to data from different publishers.

Form the list above, datamarket.com, Azure, Infochimps and Kasabi fit the more specific definition of data marketplace i.e.  provide API access to data provided by different publishers. These functionalities have their implications:

  1. Supporting different publishers calls for a managed hosted service (a place for any publisher to put its data).
  2. API Access calls for cleansing and modeling any included data.

Selling simple access to collected data (e.g. downlodable crawled lists) doesn’t involve any of the two challenges above (or involves a simpler version of them). Providing data hosting services (i.e. database-as-a-service) doesn’t necessarily involve data cleansing and modeling (as these only affect the owner of the data which is mostly its only user). Both domains, collect-and-sell-data and database-as-a-service, seem to be doing fine and enjoying a good market. On the other hand, if we look at data marketplaces, it is clear that they don’t present themselves as pure data marketplaces (not anymore at least):

datamarket.com ==> sells the platform as well, specialises in numbers and focuses on visualization.

Infochimps ==> calls itself “Big Data Platform for the Cloud”

Azure Data Marketplace ==> is still a pure marketplace but as part of the Microsoft Azure Cloud Platform.

All these make me wondering, is data marketplace too big a thing to be tackled now? is the market not ready? technology and tools not ready? are marketplaces not selling themselves well? should we give up the idea of having a marketplace for data?

I am just having hard time trying to understand…

P.S. All the best for the great Kasabi team… I learned a lot from you!

Promises of Data Marketplaces and How Can We Evaluate Them?

One of the questions I was interested in while listening to the excellent series of podcasts by Paul Miller on data marketplaces was: why would people pay to access data? this can be put differently as: what values do data marketplaces offer?

Here is a compiled list of benefits that data marketplaces promise:

  • Discoverability: through a central place where datasets are described and can be found.
  • Easy access to the data: via providing API access to the data for example.
  • Easy publishing: of-the-shelf infrastructure.
  • Commercialisation: easy buying and selling data.
  • Better data quality: providing curated and maintained datasets.
  • Value-added data: having all the datasets in one place enables users (or the marketplace provider) to draw new insights, remix datasets and derive new ones.

A logically following question is: how can we evaluate the extent to which data marketplaces are fulfilling their promises? With the expanding belief that data should be made available for free, it is important for data marketplaces to make clear the additional value they offer. Ironically maybe, this can prove to be very helpful to the open data movement as quality complaints that usually accompany open data can be addressed by marketplaces with a non-prohibitive cost on the consumer side… I believe that an empirical study of the existing data marketplaces can reveal interesting insights and lessons. I don’t have a clear idea about how to evaluate the impact that data marketplaces have achieved regarding their potential benefits but few sketchy ideas…

  • Discoverability: do data markets enhance metadata description of datasets? provide an API to search for datasets? standardise metadata description? etc…
  • Easy access to the data: this boils down to evaluating the access method (mostly an API) provided along with the service quality metrics such as availability, performance, etc… An interesting idea I came across in this paper(PDF) is that the prevailing charge-per-transaction model hinders ease of acces as clients might have to cache results. Data Licensing is also related to the ease of access and data marketplaces have the potential of fostering convergence on a small, but sufficient, set of data licenses.
  • Easy publishing: evaluating the set of services the data market provides for publishers
  • Commercialisation: what percentage of datasets on a marketplace is not free? are there datasets available for sale  on a market but not anywhere else (its commercialisation is solely enabled by the existence of the market place)?
  • Better data quality: did marketplaces enhance the quality of (open) data available elsewhere?
  • Value-added data: can users meaningfully remix existing datasets? is there a market-wide query engine? are there new datasets provided by the data marketplace through drawing insights from or remixing a number of existing datasets?

One of the biggest challenges here is that the term “data marketplace” is sill used in a very loose manner which risks ending up comparing apples with oranges… However, a carefully designed comparison can prove vital in advancing the current state of art. I’d be very glad to hear your ideas and feedback on this.

A Faceted Browser over SPARQL Endpoints

I have recently been working on (yet another?) faceted browser over RDF data… more precisely RDF data loaded in a SPARQL endpoint that support COUNT and GROUP BY queries. I have successfully used it against Fuseki, Talis platform(tested against http://api.talis.com/stores/bbc-backstage/services/sparql) and Virtuoso (tested against http://dbpedia.org/sparql)

Main characteristics:

  1. Configurable: most aspects of the browser is configurable through two JSON files (configuration.json and facets.json). This includes basic templating & styling ability. To change the style, add a facet or browse a completely different data; just edit the json files accordingly and reload the page
  2. No preprocessing required: as all request are standard SPARQL queries… nothing is required a priori neither on the publisher nor on the consumer end
  3. Facets are defined as triple patterns (see example) therefore facets values don’t need to be necessarily directly associated with the items browsed i.e. they can be a few hops of the browsed items

see the screenshot below to get a feeling of the browser…

If you have used Google Refine before, the resemblance is probably clear. Indeed, I am reusing the JavaScript and CSS code that makes the facets part of Google Refine (they are gratefully under New BSD License.. how much I love open source!!!)

Having it running

The code is shared on github. grab it, deploy it to a Java application server and then play with the configuration files  (a number of examples are provided with the source)

Outlook and //TODO

The most exciting part to me (thanks to the Google Refine inspiration) is that all the needed knowledge about the endpoint and the facets are maintained as JSON on the client side and communicated with the server upon each request. If a user somehow update this configuration and submit to the server, she can customise her view (as an example I added a change button to each facet which allow the user to change only the facet she sees… potentially an add facet button can be added)

Additionally, a list of issues and features to add is on the github repository.

comments/feedback are very warmly welcomed 🙂


Update 19/03/2012:  support for fulltext search facets added. Currently supports Virtuoso and standard SPARQL (using regex). See example

Kasabi directory matrix

Kasabi is a recent player in the data marketplace space. What distinguishes Kasabi from other marketplaces (and make it closer to my heart) is that it is based on Linked Data. All the datasets in Kasabi are represented in RDF and provide Linked Data capabilities (with additional set of standard and customised APIs for each dataset… more details).

A recent dataset on Kasabi is the directory of datasets on Kasabi itself. Having worked on related stuff before, especially dcat, I decided to spend this weekend playing with this dataset (not the best plan for a weekend you think hah?!).

To make the long story short, I built a (currently-not-very-helpful) visualization of the distribution of the classes in the datasets which you can see here.

In details:
I queried the SPARQL endpoint for a list of datasets and the classes used in each of them along with their count (the Python code I used is on github, however you need to provided your own Kasabi key and subscribe to the API).
Using Protovis I visualized the data in a matrix. Datasets are sorted alphabetically while classes are sorted descendingly according to the number of datasets they are used in. Clicking on a cell currently shows count of the corresponding dataset,class pair.

Note: I filtered out common classes like rdfs:class, owl:DatatypeProperty, etc… and I also didn’t include classes that appear in only one dataset.

Quick observations:
Not surprisingly, skos:Concept and foaf:Person are the most used classes. In general, the matrix is sparse as most of the datasets are ”focused”. Hampshire dataset, containing various information about Hampshire, uses a large number of classes.

This is still of limited value, but I have my ambitious plan below 🙂
1. set the colour hue of each cell according to the corresponding count i.e. entities of the class in the dataset
2. group (and may be colour) datasets based on their category
3. replace classes URIs with curies (using prefix.cc?)
4. when clicking on a cell, show the class structure in the corresponding dataset i.e. what properties are used to describe instances of the class in the corresponding dataset (problem here is that I need to subscribe to the dataset to query it). This can be a good example about smooth transition in RDF from schema to instance data