eTech05: RDF in the Wild

I was going to attend Zawodny’s Yahoo talk, but as it’s mainly an API overview I figured I’d get more out of this etech session.

The Creative Commons license is a fairly new license that many web sites (including this one) use to indicate acceptable re-use of their content. The CC license contains a good bit of embedded RDF and this now allows, some 18 months later, for some interesting analysis of RDF in the wild.

In particular, an RDF-based creative commons search engine is now available.

So why did they choose to use the semantic web, in terms of using RDF to encode metadata in the license? No easy way to track licenses as they have no staff and search engines wouldn’t necessarily pick up all the metadata in plain text. And so, it was decided that an experiment using RDF, with thousands of licenses in the wild, would be a good thing.

The slides give a good overview of why RDF in html was chosen over a variety of other techniques.

One of their goals was that the license and metadata be generated by the server issuing the license, and also be able to be simply pasted into a web page.

Next was a demo of the creative commons search engine; try it for yourself!

The second part of the talk focused on the technical underpinnings of their custom search engine. For the prototype, both the crawler and the app itself were written in python, and postgresql was used as the database. Intolerably slow.

Lesson: search engines are hard to do, but the prototype whetted the appetite for a creative commons search engine.

Next they turned to the open source engine known as Nutch. After modifying it to meet their needs (parse CC RDF), indexing of creative commons licensed docs went forward.

Next were shown some examples of interesting CC searches that may be performed; use your imagination!

The talk concluded with some possible future directions that CC metadata may take, including semantic XHTML.

I’ve long been skeptical of the semantic web, but I think this is a great example of an application where it makes sense to use RDF to support a semantic sub-web.