Tapping into the web of data (Cosmin Basca)

Tags: europython2009, europython, rdf

Cosmin works for DERI which is one of the largest semantic web research institutes of the world. (In one of the rainiest parts of the world, too, btw).

Why would you want to use the semantic web? Well, there is a lot of data and you might want to use it. You can easily aggregate multiple sources. You can evolve data sources without too much worries about migration issues: the semantic web is by nature very robust. You might not support all the new features of a new format, but your application will probably still work.

Some popular much-used formats are FOAF and DOAP. Also interesting is DBPedia which provides information extracted from wikipedia. And as semantic web data is linked, you can actually already find a lot of linked information once you have a starting point. Start with a town and you find the name of a famous local musician which links to a music database in turn.

A starting point for the semantic web is the data format RDF. The core of RDF is a subject/predicate/object “triple”, for instance reinout/blogs_about/europython. With the addition that almost all items aren’t plain text but a url. So “http://reinout.vanrees.org” instead of “reinout”. The big advantage: a URL is a strong reference. It is unique.

SPARQL is a query language for semantic web files. But just as you have an ORM (object-relational mapper) for mapping pure sql queries onto objects, you can have an O-RDF (object-rdf mapper). This is handled by their SuRF tool. How do we see RDF data or as a set of resources? Resources map much more naturally to objects. SuRF is inspired by ActiveRDF which was developed at DERI for the ruby language.

An RDF resource is defined as all triples (subject/predicate/object) with the same subject. So it only looks at the “outgoing” predicates/relations. If something else says something about you, it is not automatically included. The predicates are accesses as attributes, so instance.namespace_attribute. It uses lazy loading. So cosmin.foaf_knows returns the friends that Cosmin knows. There’s also a convention for looking up the “reverse” properties (so what other instances are telling about us). Using a dynamic language like Python really helps here.

SuRF has session handling which means that if you modify data, the changes aren’t written back to the various aggregated data stores until you call a commit(). For data sources, you can write a plugin for SuRF and provide an RDFReader and RDFWriter class.

The code is open source (BSD) and available on pypi. It is easy to get started with a simple read-only instance.

If you want to integrate SuRF with a web framework, try to pick a framework like pylons (and probably turbogears) that doesn’t have a lot of home-grown components. In pylons it is easy to plug in a different data source (so: SuRF).

Web of data talk
 
vanrees.org logo

About me

My name is Reinout van Rees and I work a lot with Python (programming language) and Django (website framework). I live in The Netherlands and I'm happily married to Annie van Rees-Kooiman.

Weblog feeds

Most of my website content is in my weblog. You can keep up to date by subscribing to the automatic feeds (for instance with Google reader):