I was at a rather varied collection of conferences this year, at each of which, however, I found rather more that was relevant to the VO than I expected. Either by chance or design, we seem to be doing quite a lot of what other specialities regard as good practice. The following describes these interactions, partly to encourage ourselves, but also to examine the differences, and see whether we're missing a trick.
The four meetings (apart from the IVOA Kyoto meeting, of course) were the Manchester Protégé tutorial, run by the Manchester CS folk to spread the good word and good practice; a DCC workshop on persistent identifiers, which I discovered to be a lot more contentious than you might think; the European Computing and Philosophy conference, where I discovered that practicality is not one of philosophers' key skills; and the ISWC, where I discovered that we're already behind the breaking wave of Semantic Web Cooool (and that this is a good thing).
Or... how to make pizza.
The Manchester Computer Science
department runs a tutorial on Protégé, which focuses on developing
an ontology for pizzas, their bases and their toppings, and which
climaxes in the challenge of defining a defintion for
VegetarianPizza
(somewhat more complicated than you'd
imagine). The tutorial, which has been running in one version or
another for some years, is concerned with ontologies in general, and
the Protégé ontology editor in particular.
There's a collection of materials from the Semantic Web Best Practices group, containing cookbooks and various still-baking proposals. There are also various resources on the Co-ode web pages.
There's more than one way to deliver your ontology.
The most obvious way (from at least one point of view) is to ship your intricately described ontology to applications which are expected to do clever things with it. That is, you expect that the application will make use of a reasoner, either internally or as an external service, and do its inferencing on the fly -- deducing as lazily as possible whether a given instance is or is not a member of a given class, perhaps.
The other possibility is for the ontology to be shipped pre-coordinated, that is, with all the possible deductions about the subsumption hierarchy worked out in advance. You can view this as a sort of `compilation' of the ontology. This gives you the freedom to maintain an `inverse tardis', with a small (and thus maintainable) and potentially intricate core, expanding into a much larger but simpler deliverable.
The main advantage of this is that it does not require the
application which uses the ontology to have any (access to) reasoning
capabilities, beyond whatever is implied by its simple ability to read
in the ontology in its pre-coordinated format. The cases in which
this is a stand-out win are some medical ontologies which have class
names like {Upper,Lower}{Tibia,Fibia}{Break,Fracture}
which are declared logically, but which expand into a combinatorially
large number of simple classes.
To the extent that it's an ontology, the UCD hierarchy is a pretty simple one, enough that the difference between a pre- and post-coordination ontology would be rather minor, enough that there are rather few applications which would benefit from the pre-coordination one. So this isn't a big issue for UCDs, but it's probably a useful distinction, and a reassurance that there are principled ways to avoid applications having to embed reasoners.
Alan Rector talked about best practices in creating ontologies, in particular about `normalised' ontologies. The idea is similar to the notion of normalisation of databases, and has the same goals of modularisation for maintainance. It's described at some length in [rector03], referring to ideas in the earlier [guarino00]. This document contains useful and concrete advice, in terms of useful vocabulary of `primitive' and `derived' concepts. It suggests that the modules into which you divide your ontology should include a skeleton of strictly hierarchical and disjoint primitive classes, where the subsumption relation is IsA (as opposed to PartOf or something similar). This means that in this skeleton tree there is only single inheritance, and any multiple inheritance appears only after reasoning, as a result of restrictions on the primitive concepts.
Katy Wolstencroft described a system for extracting and classifying proteins from newly sequenced genomes. The problem here is that genome sequencing now very fast -- taking months rather than years -- so data is now being produced faster than it can be analysed. Sequencing produces DNA sequences, which encode protein molecules, which have complex structures made up of thousands of instances of a basic set of 20 or so structurally and chemically distinct amino acids. There are patterns in these sequences, called `domains', and the analysis consists of finding these domains within the sequences
Humans do this domain-spotting, assisted by tools, but using their own knowledge. What Katy did, for a particular family of proteins, was formally describe the rules for membership of a particular class -- on what grounds is a particular bit of sequence identified as being in a particular domain? Given an initial hierarchy plus a set of membership rules, the reasoner's job is therefore to generate the larger inferred hierarchy, and then for each individual sequence to work out which classes/domains that sequence is a member of.
The ontology was able to produce a classification which matched that produced by expert human annotators (ie, it was working), and was further able to identify domains that they had missed (ie, it was generating science).
The other useful thing about this work (which was described in more detail at ISWC2005 [wolstencroft05]) is that `a little ontology goes a long way', in the sense that specifying the class relations required only rather unexciting (to a computer scientist) logical relations. Add to this the claimed logical deficiencies of the GO (see the discussion of coherent ontologies below), and it's plain that reasoning tools are a lot more useable in the real world than they might at first appear.
There are various problems in astronomy which are analogous to this, such as the problem of classifying objects in a catalogue. Of course there exist programs which can do these, using hand-implemented algorithms, but this approach holds the promise of having a separate system do the heavy lifting, leaving the logic to be specified at a usefully high level.
The Gene Ontology is the Semantic Web community's current poster child. It's a large and complicated, but still practical, ontology, which has been worked on by many Computer Scientists, but controlled by Biologists, and is successful in the very practical sense that it is routinely producing Science (as in biology, not CompSci). Carole Goble was talking about the various reasons why the GO has been as successful as it has.
There's a fuller account in [bada04].
I think there are a few things we can take from that in the VO. The first is that leadership matters: the GO didn't arise by itself, and didn't arise because biologists were baying for ontologies, but came about because a small group (or was it one individual?) pushed and pushed until the rest of the community saw what the benefit was and joined in. The IVOA is rather more faceless than that, but should probably find this reassuring.
Secondly, the GO is less rigourous than computer scientists would wish, and the world has not ended as a result. The IVOA is in some respects currently preferring practicality over elegance, and it's reassuring that this general approach worked for the biologists.
The other, really important, thing is that the ontology is in constant flux. It seems that the there are tens of changes to the ontology per day, and these are either live or (I'm not sure) released on a sufficiently short timescale that they might as well be (the main GO ontology download is updated on the GO web pages every 30 minutes). The ontology has grown from 1000 concepts four years ago, to 19000 now. The ontoloogy changes both because there are errors and deficiencies fixed, and because the subject itself changes.
This rate of change is much larger than we expect to see in any anticipated astronomical ontology, but the headline point here, made about this ontology and it seems every production ontology I've heard about over the last few months, is that ontologies change, and whether this happens on a timescale of hours, weeks or years, the systems which make use of them simply must be able to handle this. Although I heard this said of many ontologies, nowhere have I heard the complaint that this is deeply problematic: as long as you can make the changes in a suitable language -- stating formally that concept X is now a subclass of concept Y for example -- then the reasoner doesn't care. And because the tools handle the reasoning internally, the users don't care either.
This, therefore, turns into an aspect of the more general problem of aligning ontologies (my favourite hobbyhorse), with the almost-isomorphic ontologies here being two instances of the one ontology separated by time, rather than two datacentres' similar offerings.
There are multiple general observations about the tensions involved when computer scientists talk to natural scientists, in Carole's excellent [goble05], which is a member of that small class of papers with an abstract in blank verse.
The Digital Curation Centre is a multi-centre UK initiative to develop expertise in the curation of digital data from a range of sources from the humanities to the sciences. A lot of their personnel are from the library community, but with others from computing science and space research (David Giaretta is one of its associate directors).
The DCC organised a workshop on Persistent Identifiers in June/July. I went along to it mostly, if truth be told, because it was taking place in the next-door building in Glasgow, and I fancied a change of wallpaper for a couple of days. After all, with URIs and URNs in the world, persistent identifiers are all sorted out, aren't they. Aren't they?
Oh, no they're not...
URNs are either URIs with `an institutional commitment
to persistence' (this phrase has been knocking around for a while,
without ever quite appearing in a standard; it may have first appeared
in a
www-tag
message from Paul Prescod), or else specific
URIs with the syntax described in [std:rfc2141], but
this in fact does nothing more than create a bit of terminology and a
bit of syntax, and skips hard questions such as
This workshop was useful, therefore, not for the answers it offered, but for the rather obvious-in-retrospect questions it asked.
This question is a fairly reliable way to start a fight, in the right company. Does an identifier identify a particular set of bytes, or might it represent only the current version of a resource, so that the contents returned when it is dereferenced (if indeed it ever is) could potentially change each time? Does a homepage URL refer to a person or to a web page? Are identifiers still useful (and not just in theory) if they are not dereferencable? Should identifiers have a `meaning', and if so, who gets to specify it? Is there an important practical difference between a locator (a URL) and an identifier (a URI)? Oh yes, and what about persistence?
While it's possible to have quite enjoyably sophisticated arguments about all of these questions, and to reasonably take all of the possible implied positions, and probably several of them simultaneously in different contexts, there is no consensus on any of them, and the only stable resolution is to avoid giving a technical answer to a fundamentally social problem. This position is blessed also in the recent Web Architecture document [std:webarch] (or rather, in the rdfURIMeaning-39 TAG `issue', related to the WebArch document), which seems to conclude that `meaning is use', and tries to leave it at that. Thus only the creator or owner of a binding (of identifier to resource) can identify the boundaries and invariants of the resource in question (Henry Thomson), and so if homepage URLs end up being used to identify people, then that is therefore what they mean.
The IVOA of course has its own persistent identifiers, the IVOA Identifiers [std:ivoaid]. It seems useful to compare that document with some of the best practices and problems raised in this workshop. Quite a few of these best practices are also included in the WebArch document [std:webarch]. For those who aren't familiar with this, it's a much argued over recent document describing what are believed to be the architectural principles which let the web be as successful as it has been. It's not holy writ, but I for one believe that its suggestions are such that deviation from them should generally be acknowledged as such and explained.
There does seem to be
something of a consensus against creating new URI schemes. The
WebArch document says so explicitly [std:webarch], and it seemed to be one of the few things
about which there was fairly general agreement at this workshop.
Henry Thompson (Edinburgh and W3C Technical Architecture Group) listed
as one of his headline Good Things `to benefit from and increase the
value of the World Wide Web, agents should provide http:
URIs as identifiers for resources'. For the record, the rest were:
`global naming leads to global network effects'; `it is good for the
ownership of a name URI to be manifest in the form of that URI'; `a
URI owner should not associate arbitrarily different URIs with the
same resource'; `agents do not incur obligations by retrieving a
representation' (ie, GET is side-effect free); and `a URI owner should
provide representations of the resource it identifies': all of these
are exemplified to a greater or lesser extent by the IVOA ID
proposal.
At present, there are about 50 IANA URI schemes standardised (see
the IANA
list), most of which I've never heard of, and each one has gone
through the full tortuous RFC process. There is a current internet draft (see
under `individual submissions') called Guidelines and Registration Procedures for new URI
Schemes (which is reportedly likely to be approved) which
refounds the URI scheme registration mechanism of
[std:rfc2717]/[std:rfc2718], and
which distinguishes `permanent' from `provisional' scheme names. The
distinction between these two types is that the former are
ideologically pure and the latter have a simplified registration
mechanism. This doesn't undermine the architecturo-theological
principle that the http:
scheme is the One True Way; it's
simply the IANA bowing to the inevitable, and acknowledging that
people are currently using new schemes, and that since they are doing
so, it's better that these be registered than not.
All this anguish about new schemes was slightly undermined
for me by the realisation that at least two folk at this workshop were
actively involved in registering new scheme names! One of these was
the DOI, who are working on
registering the doi:
scheme, but in a surprisingly
vigourous exchange, Henry Thompson described them as the Bad Guys,
forking the web (A witch! Burn her!), so it's clear they
have quite separate inquisitorial challenges in store.
The other new URI scheme is associated with John Kunze's ARK
persistent identifier scheme. This describes a new URI scheme and
a syntax for hierarchical identifiers. In principle these identifiers
would be resolvable by an ARK resolver service, however the
expectation is that they will be resolved instead by directly
prepending a URL for a resolution service (as in
http://example.org/ark-resolver/ark:/foo/bar.fits
). ARK
therefore appears to be a new scheme, in the sense that it
conceives in principle of strings starting ark:...
, but
avoids perdition by pulling out at the last moment and defining a
resolution mechanism which turns the identifiers back into HTTP URLs
for maximum architectural goodness. This specification includes
registration authorities, a comparison algorithm (essentially, strip
everything before the ark:
) and interestingly, the
provision that if foo
is a retrievable ARK URL, then
foo?
is a metadata statement, and foo??
is a
persistence statement.
There's more in ARK, but it's basically quite simple, and very
nice. It has a clean distinction between authority IDs and a
potentially hierarchical resource key; it has object metadata and
repository statements readily to hand; and its resolution mechanism is
why-didn't-I-think-of-that clever, RESTful, and means that the only
key-resolution software you need is string concatenation and
wget
. It seems a very good impedance match to what the
VO seems to want, to the extent that it might be worth lifting ideas
from it, or possibly even adopting it wholesale, and letting
someone else deal with the punctuation and security niggles involved
in making an RFC bulletproof. The Internet Draft contains quite a
good discussion of the problems of defining and using persistent
identifiers.
The Identifiers PR describes an IVOA Resource Identifiers as
something like adil.ncsa
or
org.astrogrid.www
. This is what identifiers people refer
to as a `semantic identifier', as opposed to an `opaque identifier',
being a string which does not suggest any meaning at all. Having the
owner implied by the name in this way fits the requirements in Ray's list. It turns out that a preference for one over the
other is another theological matter, so for the benefit of those who
weren't aware what deep water they were paddling into (ie,
me), it seems useful to summarise the issues here.
Semantic IDs are things like DNS entries, or most URLs, in which
parts of the ID have, or at least appear to have, meaning. The
advantages of this are that the IDs are memorable, might persistently encode
some of the ID metadata, and might include a `brand' for the data
provider or mediator (such as org.astrogrid.www
).
Opaque IDs, on the other hand, have no apparent meaning, and might be composed of a hash or a random string. The advantages of opaque IDs are that the IDs are not memorable, encode no metadata, include no branding, and thus can excite no loyalty (in the section on the GO above, I mentioned that its concept IDs were opaque, thus avoiding any political or emotional attachments). The lack of branding is rather unexpectedly quite important, since it removes any temptations for successor projects to deprecate IDs when it becomes politically useful to demonstrate that they are taking over the world. It is claimed that opaque IDs are generally easier to manage in practice. DOIs are opaque IDs; ARK seems happy with either.
This was the European Computers and Philosophy conference in June (I wasn't really at this conference, but my wife was on the programme committee, so I just — well — marched in).
The talk that was of most relevance to us was Barry Smith talking about bio-ontologies, and what's so terribly wrong with them.
The fundamental problem, it seems, is that the set of relations
which the GO uses — namely is-a
and
part-of
— are not rich enough to accurately describe the
relations that the GO claims to be describing, with the consequence
that the terms are used in multiple senses, and the resulting
hierarchy is incoherent, and allows meaningless conclusions to be
drawn.
There's quite a long list of entertainingly daffy examples. The
relation part-of
means in various places `may be part
of', or `is sometimes part of' or `is included as a sublist in', so
that it seems that you can conclude from the GO that the `region external
to a cell' is part-of
a cell, and `pollen'
is-a
`biology'. The underlying problem here is
that, although part-of
and is-a
should be
strictly transitive properties, they are not being used as such, and
because there are only two connecting relationships, there are no
alternative relationships to use to express a less restrictive
meaning. I think that this is an example of a failure to produce a
`normalised ontology' as described above. It appears that the GO folk were already discussing this
problem informally, as a failure of some GO branches to satisfy a
`true-path' property, and have indeed since added a third relationship.
It seems also that the GO has grown unwieldy, so that there are now effectively three separate ontologies for cell structure, and for genes' molecular and biological functions, all huddled together in the GO. These issues are discussed in greater detail in [smith04].
Smith's response to this is to assert that there is a crying need for a broader range of relationships, and in [smith05] he describes such a list which can apparently describe all that the GO or the Open Biomedical Ontology needs, such that sensible inferences can be drawn. The additions are plausible (`located-in', `contained-in', `adjacent-to', and similar), but there are ten of them, which is eight more than the GO staggered along with before. Are the documentation and maintainance costs worth it?
Question number one is why, when it's apparently as horrifically bad as this, the GO is still in existence and apparently producing science? The answer to that depends on whom you're asking. The philosopher finds the ontology's incoherence repellent enough just by itself, but can also correctly point out that the incoherence can produce statements which are wrong, or not usefully meaningful. The biologists don't care, however, because they tend not to use the GO in situations where such problems matter. Yes, if you say `the exterior of a cell is part of a cell' it seems silly, but biologists Know What You Mean, and if they find such a statement in a situation where it actually is causing a problem, then they simply get on to the GO maintainers and badger them to fix it. That is
Or in other words, the GO's reasoning is good enough.
The lesson for the VO is that this argument also works for UCDs, to the extent that their core use-case is `finding stuff', but probably won't work for Data Models, since they're more involved with `using stuff'. That's a further argument for keeping those two modelling domains compatible but clearly distinct.
The final meeting I was at this year was the ISWC conference in Galway. There are notes on some of the talks at the VOTech wiki — unfortunately, you'll need a VOTech login to see them.
Though there were several interesting papers and posters, and plenty of evidence of folk doing interesting things with the Semantic Web, for me the most interesting thing about this meeting was that it was rather less thought-provoking than the meetings I've described above. The two principal current concerns of the SW community are (a) to develop more interesting or general formal logics and the reasoners to handle them, and (b) to develop some totally, like, radical cool killer app which will take over the web and have all our wearable PDAs nattering semantically to each other while we go off and live in caves. Neither of these worthy goals is to be despised, of course, but it does suggest that the sort of core functionality that we need is well enough served by an ecology of existing tools, that it's already dropping off the radar of the SW community, and turning into plain old engineering.
So what, in the end, are the lessons for the VO?
ivo:
URI scheme.