At the Cambridge IVOA meeting in May 2003, I made a noise about RDF on a couple of occasions, and it was suggested that I write down something about it. So here it is.
This is not really a beginners' guide to RDF, because (a) I'm not expert enough to claim I should write such a thing, and (b) because several such guides already exist, and I point to them below. My aim here is simply to draw attention to RDF in the astronomical context, give a rapid introduction to it, and point out why and how it might be useful. I'd be delighted to receive corrections and amplifications on what I write here.
$Revision: 1.2 $
Example first, syntax later. Consider the set of RDF statements illustrated here. This appears to be a very elaborate way of saying something simple, but we'll get to the point of that in a moment.
What this says is that the resource with URI
#m31 has a property called `luminosity', the
value of which is an anonymous resource (meaning simply
that it doesn't itself have a URI). That resource has two
properties, one with property name `numericValue' which has a
literal value, and one with property name `type', the
value of which is another resource, with URI
and that resource in turn has the `subClassOf' property with value
That introduces essentially all of the RDF terminology and ideas:
The property names don't look like URIs, but are more formally
indicating that the `luminosity' property is a local one, `type'
is part of the basic RDF syntax, and `subTypeOf' is part of a
particular RDF Schema, namely the core one described as part of
the RDF spec.
As noted here, the property `subTypeOf' is part of a particular
`RDF Schema'. A Schema in this context is simply a particular set
of properties, with the semantics of the various properties
carefully described in text, and perhaps with interrelations
enough thay they can support some inferencing. This, as far as
I'm aware, is all that the term ontology actually refers
to. In this example, I imported the `type' and `subClassOf'
properties from their respective ontologies, and the properties
`luminosity' and `numericValue' come from a home-made ontology.
The `inferencing' here is not particularly profound; it simply
means that a general RDF processor (as long as it `knew about' the core
Schema which includes `subTypeOf') confronted with this set
of RDF statements, could infer that, as well as having a
#Vmag of `8', the resource
#m31 can be
taken to have a
#Mag of `8', also. (There's
obviously a great deal of enjoyable argument extractable from the
phrases `knew about', `infer' and `taken to have', but the
useful work done by these relations is clear). DAML+OIL and OWL
are other examples of ontologies.
That's more-or-less it.
The introduction to the RDF Recommendation describes the goals of RDF, and one passage in particular says
[RDF] provides interoperability between applications that exchange machine-understandable information on the Web. RDF emphasizes facilities to enable automated processing of Web resources. RDF can be used in a variety of application areas; for example: in resource discovery to provide better search engine capabilities, in cataloging for describing the content and content relationships available at a particular Web site, page, or digital library, by intelligent software agents to facilitate knowledge sharing and exchange, in content rating, in describing collections of pages that represent a single logical "document" [...]
That's the sort of language, with `RDF' changed to `the VO' I
wouldn't be surprised to find in a VO funding application. Add
in language like
Due to RDF's incremental extensibility,
agents processing metadata will be able to trace the origins
of schemata they are unfamiliar with back to known schemata
and perform meaningful actions on metadata they weren't
originally designed to process, and it seems fairly clear
that the problem that RDF aims to solve overlaps significantly
with the opportunities the VO wants to exploit.
It's unfortunate that, with the exception of this homily at the beginning, the RDF spec, and even the W3C RDF primer, talks a lot about syntax, but makes rather a poor job of explaining just what RDF is for.
An excellent introduction, which does discuss this, is Tim
xml.com article. That article remarks
[RDF] is a framework for describing and interchanging
metadata, which sums it up admirably.
The Point, it seems to me, is that the sort of primitive
statements we want to make --
M31 has luminosity x,
V-magnitude is a subclass of Magnitude -- are, with only a little
processing, better modelled by RDF triples than by XML trees, so
that there is a better impedance match between what we want to
say and RDF's way of saying it.
Further, the framework is expressly designed for interoperability, both by making the primitive statements storable in a wide variety of formats, and by confronting head-on the problem of associating meaning with elements in the (RDF) Schema.
Another part of the point is that the very basic ideas which RDF exposes are very illuminating, and provide (in my opinion) a good intellectual toolbox for discussing other data-modelling problems, including those which don't explicitly involve RDF.
RDF gets some of its power from being composed of very primitive ideas, namely the resource-property-value trichotomy, and (what is much the same thing) the notion of subject-verb-object triples as the primitive type of statement which, with only minimal ingenuity, is sufficient to express everything you want. What this gets you is a system which is not useful in isolation, since the set of properties defined in the RDF spec and the core RDF Schema is tiny, but which provides the language which efforts such as OWL or DAML+OIL can use, so that they can produce useful property sets.
Because it's so simple, it's tremendously interoperable. RDF triples can be stored in XML (in a bewildering variety of syntaxes), in databases, on blackboards, even in Fortran, probably.
The other thing you gain from the simplicity is an antidote to the dominance of XML. It's easy to be sucked into the `all the world is XML' mindset, and find yourself ramming into a tree form all sorts of structures which aren't really organised like that, in much the same way that old Fortran programmers can be observed writing Fortran in any language you teach them.
Metadata lets you process resources in a generic fashion --
that's the point of it. Since in RDF properties are also
resources, it follows that it's possible to process them
generically also. That processing has the potential to be
tremendously sophisticated (and this is where the AI and
knowledge-engineering folk come in), but it can also, usefully, be as
simple as in the example above: if I want a resource which is of
#Mag, I can deduce that something of type
#Vmag will do.
The resources which RDF statements describe are all named using URIs. That's possible because URIs are intended to be a perfectly general syntax for naming things, even things which aren't, or could never be, on the web, such as people or concepts. They are therefore distinguished from URLs, which is the subset of URIs which are also addresses, and which refer to things which are actually network-retrievable; see RFC 2396 for distinctions and scope (this is currently, as of June 2003, being revised; the draft replacement rfc2396bis expires 2003 December 5).
The example at the top consists of a description of a resource
#m31 -- a local URI. While this is
natural, there is nothing to stop you creating a set of RDF
statements about any URI. Thus, with no added syntax, we have a
way of creating stand-off metadata, and thus adding information
about resources you do not control, or which are not XML, or
which are read-only.
Because it's too complicated. Consider the example at the top again. It could be written in XML as:
<resource> <uri>#m31</uri> <luminosity> <type> #Vmag <subClassOf> #Mag </subClassOf> </type> <value>8</value> </luminosity> </resource>
<resource uri="#m31"> <luminosity type="#Vmag|#Mag"> 8 </luminosity> </resource>
or any one of a large number of other ways. As humans, we have little difficulty in making some sense of this, but it would be a fuss at least to define the XML Schema which this is part of, and to write software to process it. The problem is that, in XML, concepts like trees, children, attributes, which are so helpful when we wish to mark up general data, simply get in the way when we wish to express a set of flexibly primitive statements. XML is self-describing only to humans; machines need standards documents and piles and piles of Java.
xml.com article, Tim Bray asks this same question, and the point is addressed again in
more detail in Tim Berners-Lee's RDF-vs-XML
This pictorial form shown above is one of the defined notations for RDF. The other well-known one is the XML serialisation. The RDF/XML version of the example above is:
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns='http://x.net/#' xmlns:rdfs='http://www.w3.org/2000/01/rdf-schema#' > <rdf:Description rdf:about='#Vmag'> <rdfs:subClassOf rdf:resource='#Mag'/> </rdf:Description> <rdf:Description rdf:about='#anon0'> <rdf:type rdf:resource='#Vmag'/> <numericValue>8</numericValue> </rdf:Description> <rdf:Description rdf:about='#m31'> <luminosity rdf:resource='#anon0'/> </rdf:Description> </rdf:RDF>
...which looks horrible. If you want to unpack that, you can do so using one of the RDF primers I point to below. I introduce it here only to make the observation that the RDF/XML syntax is not intended to be human-readable, or even particularly human-writable. As I understand it, a primary consideration was that it be flexible enough to be embeddable within other XML in a wide variety of ways.
A slightly more readable example is this one:
<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns="http://x.net/rdf/#"> <rdf:Description rdf:about='http://x.net/rdf-intro'> <Author rdf:resource='http://x.net/norman/'/> </rdf:Description> <rdf:Description rdf:about='http://x.net/norman/'> <Name>Norman Gray</Name> <Email rdf:resource='mailto:firstname.lastname@example.org' /> </rdf:Description> </rdf:RDF>
The graphical representation of it is here. That
indicates that the
has for its
#Author property the resource
http://x.net/norman/, which has the two further
resources shown. More clearly than in the previous example,
each of the resources is a URI.
Even though this is a little simpler than the previous example, it is still rather difficult to read. Clearest is to break this into the three subject-predicate-object triples which the RDF fundamentally represents. They are:
<http://x.net/rdf-intro> <#Author> <http://x.net/norman/> . <http://x.net/norman/> <#Name> "Norman Gray" . <http://x.net/norman/> <#Email> <mailto:email@example.com> .
The triples form of the first example is:
<#m31> <#luminosity> <#anon0> . <#anon0> <#numericValue> "8" . <#anon0> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <#Vmag> . <#Vmag> <http://www.w3.org/2000/01/rdf-schema#subClassOf> <#Mag> .
There is another syntax, called Notation3 or n3, which is similar to this triples notation, and which is designed to be `scribblable', in the sense of being writable on a whiteboard. There are tools available to convert back and forth between these various syntaxes. There are other syntax proposals, but the main point of RDF is not syntax, and we shouldn't become bogged down in it.
http://www.w3.org/RDF/is the home of all things RDF. This is still in flux, and I understand that the later documents here largely supersede the 1999 spec, even though they're not yet (June 2003) at recommendation.
What is RDF?which, most unusually for these introductions, does actually explain what the point is.
The IVOA semantics list has had a number of discussions on RDF and ontology and the like. Notable (to me) messages and threads include
xml.comarticle, Edd Dumbill describes some of the possibilities of RDF, using R V Guha's rdfDB