The VO and the SW

This is a multi-conference report, describing the ways in which the VO is and is not following what other specialities seem to regard as best practice. There seems a lot more of the former than the latter (which is reassuring).

I was at a rather varied collection of conferences this year, at each of which, however, I found rather more that was relevant to the VO than I expected. Either by chance or design, we seem to be doing quite a lot of what other specialities regard as good practice. The following describes these interactions, partly to encourage ourselves, but also to examine the differences, and see whether we're missing a trick.

The four meetings (apart from the IVOA Kyoto meeting, of course) were the Manchester Protégé tutorial, run by the Manchester CS folk to spread the good word and good practice; a DCC workshop on persistent identifiers, which I discovered to be a lot more contentious than you might think; the European Computing and Philosophy conference, where I discovered that practicality is not one of philosophers' key skills; and the ISWC, where I discovered that we're already behind the breaking wave of Semantic Web Cooool (and that this is a good thing).

1 The Manchester Protégé Tutorial

Or... how to make pizza.

The Manchester Computer Science department runs a tutorial on Protégé, which focuses on developing an ontology for pizzas, their bases and their toppings, and which climaxes in the challenge of defining a defintion for VegetarianPizza (somewhat more complicated than you'd imagine). The tutorial, which has been running in one version or another for some years, is concerned with ontologies in general, and the Protégé ontology editor in particular.

1.1 Best practices, and other decisions to be made

There's a collection of materials from the Semantic Web Best Practices group, containing cookbooks and various still-baking proposals. There are also various resources on the Co-ode web pages.

1.1.1 Pre-coordinated ontologies

There's more than one way to deliver your ontology.

The most obvious way (from at least one point of view) is to ship your intricately described ontology to applications which are expected to do clever things with it. That is, you expect that the application will make use of a reasoner, either internally or as an external service, and do its inferencing on the fly -- deducing as lazily as possible whether a given instance is or is not a member of a given class, perhaps.

The other possibility is for the ontology to be shipped pre-coordinated, that is, with all the possible deductions about the subsumption hierarchy worked out in advance. You can view this as a sort of `compilation' of the ontology. This gives you the freedom to maintain an `inverse tardis', with a small (and thus maintainable) and potentially intricate core, expanding into a much larger but simpler deliverable.

The main advantage of this is that it does not require the application which uses the ontology to have any (access to) reasoning capabilities, beyond whatever is implied by its simple ability to read in the ontology in its pre-coordinated format. The cases in which this is a stand-out win are some medical ontologies which have class names like {Upper,Lower}{Tibia,Fibia}{Break,Fracture} which are declared logically, but which expand into a combinatorially large number of simple classes.

To the extent that it's an ontology, the UCD hierarchy is a pretty simple one, enough that the difference between a pre- and post-coordination ontology would be rather minor, enough that there are rather few applications which would benefit from the pre-coordination one. So this isn't a big issue for UCDs, but it's probably a useful distinction, and a reassurance that there are principled ways to avoid applications having to embed reasoners.

1.1.2 Normalised ontologies

Alan Rector talked about best practices in creating ontologies, in particular about `normalised' ontologies. The idea is similar to the notion of normalisation of databases, and has the same goals of modularisation for maintainance. It's described at some length in [rector03], referring to ideas in the earlier [guarino00]. This document contains useful and concrete advice, in terms of useful vocabulary of `primitive' and `derived' concepts. It suggests that the modules into which you divide your ontology should include a skeleton of strictly hierarchical and disjoint primitive classes, where the subsumption relation is IsA (as opposed to PartOf or something similar). This means that in this skeleton tree there is only single inheritance, and any multiple inheritance appears only after reasoning, as a result of restrictions on the primitive concepts.

1.2 Ontologies in bioinformatics

Katy Wolstencroft described a system for extracting and classifying proteins from newly sequenced genomes. The problem here is that genome sequencing now very fast -- taking months rather than years -- so data is now being produced faster than it can be analysed. Sequencing produces DNA sequences, which encode protein molecules, which have complex structures made up of thousands of instances of a basic set of 20 or so structurally and chemically distinct amino acids. There are patterns in these sequences, called `domains', and the analysis consists of finding these domains within the sequences

Humans do this domain-spotting, assisted by tools, but using their own knowledge. What Katy did, for a particular family of proteins, was formally describe the rules for membership of a particular class -- on what grounds is a particular bit of sequence identified as being in a particular domain? Given an initial hierarchy plus a set of membership rules, the reasoner's job is therefore to generate the larger inferred hierarchy, and then for each individual sequence to work out which classes/domains that sequence is a member of.

The ontology was able to produce a classification which matched that produced by expert human annotators (ie, it was working), and was further able to identify domains that they had missed (ie, it was generating science).

The other useful thing about this work (which was described in more detail at ISWC2005 [wolstencroft05]) is that `a little ontology goes a long way', in the sense that specifying the class relations required only rather unexciting (to a computer scientist) logical relations. Add to this the claimed logical deficiencies of the GO (see the discussion of coherent ontologies below), and it's plain that reasoning tools are a lot more useable in the real world than they might at first appear.

There are various problems in astronomy which are analogous to this, such as the problem of classifying objects in a catalogue. Of course there exist programs which can do these, using hand-implemented algorithms, but this approach holds the promise of having a separate system do the heavy lifting, leaving the logic to be specified at a usefully high level.

1.3 Why the Gene Ontology succeeded

The Gene Ontology is the Semantic Web community's current poster child. It's a large and complicated, but still practical, ontology, which has been worked on by many Computer Scientists, but controlled by Biologists, and is successful in the very practical sense that it is routinely producing Science (as in biology, not CompSci). Carole Goble was talking about the various reasons why the GO has been as successful as it has.

There's a fuller account in [bada04].

I think there are a few things we can take from that in the VO. The first is that leadership matters: the GO didn't arise by itself, and didn't arise because biologists were baying for ontologies, but came about because a small group (or was it one individual?) pushed and pushed until the rest of the community saw what the benefit was and joined in. The IVOA is rather more faceless than that, but should probably find this reassuring.

Secondly, the GO is less rigourous than computer scientists would wish, and the world has not ended as a result. The IVOA is in some respects currently preferring practicality over elegance, and it's reassuring that this general approach worked for the biologists.

1.4 Ontology change

The other, really important, thing is that the ontology is in constant flux. It seems that the there are tens of changes to the ontology per day, and these are either live or (I'm not sure) released on a sufficiently short timescale that they might as well be (the main GO ontology download is updated on the GO web pages every 30 minutes). The ontology has grown from 1000 concepts four years ago, to 19000 now. The ontoloogy changes both because there are errors and deficiencies fixed, and because the subject itself changes.

This rate of change is much larger than we expect to see in any anticipated astronomical ontology, but the headline point here, made about this ontology and it seems every production ontology I've heard about over the last few months, is that ontologies change, and whether this happens on a timescale of hours, weeks or years, the systems which make use of them simply must be able to handle this. Although I heard this said of many ontologies, nowhere have I heard the complaint that this is deeply problematic: as long as you can make the changes in a suitable language -- stating formally that concept X is now a subclass of concept Y for example -- then the reasoner doesn't care. And because the tools handle the reasoning internally, the users don't care either.

This, therefore, turns into an aspect of the more general problem of aligning ontologies (my favourite hobbyhorse), with the almost-isomorphic ontologies here being two instances of the one ontology separated by time, rather than two datacentres' similar offerings.

There are multiple general observations about the tensions involved when computer scientists talk to natural scientists, in Carole's excellent [goble05], which is a member of that small class of papers with an abstract in blank verse.

2 DCC persistent identifiers workshop

The Digital Curation Centre is a multi-centre UK initiative to develop expertise in the curation of digital data from a range of sources from the humanities to the sciences. A lot of their personnel are from the library community, but with others from computing science and space research (David Giaretta is one of its associate directors).

The DCC organised a workshop on Persistent Identifiers in June/July. I went along to it mostly, if truth be told, because it was taking place in the next-door building in Glasgow, and I fancied a change of wallpaper for a couple of days. After all, with URIs and URNs in the world, persistent identifiers are all sorted out, aren't they. Aren't they?

Oh, no they're not...

URNs are either URIs with `an institutional commitment to persistence' (this phrase has been knocking around for a while, without ever quite appearing in a standard; it may have first appeared in a www-tag message from Paul Prescod), or else specific URIs with the syntax described in [std:rfc2141], but this in fact does nothing more than create a bit of terminology and a bit of syntax, and skips hard questions such as

This workshop was useful, therefore, not for the answers it offered, but for the rather obvious-in-retrospect questions it asked.

2.1 What does an identifier identify?

This question is a fairly reliable way to start a fight, in the right company. Does an identifier identify a particular set of bytes, or might it represent only the current version of a resource, so that the contents returned when it is dereferenced (if indeed it ever is) could potentially change each time? Does a homepage URL refer to a person or to a web page? Are identifiers still useful (and not just in theory) if they are not dereferencable? Should identifiers have a `meaning', and if so, who gets to specify it? Is there an important practical difference between a locator (a URL) and an identifier (a URI)? Oh yes, and what about persistence?

While it's possible to have quite enjoyably sophisticated arguments about all of these questions, and to reasonably take all of the possible implied positions, and probably several of them simultaneously in different contexts, there is no consensus on any of them, and the only stable resolution is to avoid giving a technical answer to a fundamentally social problem. This position is blessed also in the recent Web Architecture document [std:webarch] (or rather, in the rdfURIMeaning-39 TAG `issue', related to the WebArch document), which seems to conclude that `meaning is use', and tries to leave it at that. Thus only the creator or owner of a binding (of identifier to resource) can identify the boundaries and invariants of the resource in question (Henry Thomson), and so if homepage URLs end up being used to identify people, then that is therefore what they mean.

2.2 IVOA registry IDs

The IVOA of course has its own persistent identifiers, the IVOA Identifiers [std:ivoaid]. It seems useful to compare that document with some of the best practices and problems raised in this workshop. Quite a few of these best practices are also included in the WebArch document [std:webarch]. For those who aren't familiar with this, it's a much argued over recent document describing what are believed to be the architectural principles which let the web be as successful as it has been. It's not holy writ, but I for one believe that its suggestions are such that deviation from them should generally be acknowledged as such and explained.

2.3 The new URI scheme

There does seem to be something of a consensus against creating new URI schemes. The WebArch document says so explicitly [std:webarch], and it seemed to be one of the few things about which there was fairly general agreement at this workshop. Henry Thompson (Edinburgh and W3C Technical Architecture Group) listed as one of his headline Good Things `to benefit from and increase the value of the World Wide Web, agents should provide http: URIs as identifiers for resources'. For the record, the rest were: `global naming leads to global network effects'; `it is good for the ownership of a name URI to be manifest in the form of that URI'; `a URI owner should not associate arbitrarily different URIs with the same resource'; `agents do not incur obligations by retrieving a representation' (ie, GET is side-effect free); and `a URI owner should provide representations of the resource it identifies': all of these are exemplified to a greater or lesser extent by the IVOA ID proposal.

At present, there are about 50 IANA URI schemes standardised (see the IANA list), most of which I've never heard of, and each one has gone through the full tortuous RFC process. There is a current internet draft (see under `individual submissions') called Guidelines and Registration Procedures for new URI Schemes (which is reportedly likely to be approved) which refounds the URI scheme registration mechanism of [std:rfc2717]/[std:rfc2718], and which distinguishes `permanent' from `provisional' scheme names. The distinction between these two types is that the former are ideologically pure and the latter have a simplified registration mechanism. This doesn't undermine the architecturo-theological principle that the http: scheme is the One True Way; it's simply the IANA bowing to the inevitable, and acknowledging that people are currently using new schemes, and that since they are doing so, it's better that these be registered than not.

All this anguish about new schemes was slightly undermined for me by the realisation that at least two folk at this workshop were actively involved in registering new scheme names! One of these was the DOI, who are working on registering the doi: scheme, but in a surprisingly vigourous exchange, Henry Thompson described them as the Bad Guys, forking the web (A witch! Burn her!), so it's clear they have quite separate inquisitorial challenges in store.

2.4 A sideline: ARK persistent identifiers

The other new URI scheme is associated with John Kunze's ARK persistent identifier scheme. This describes a new URI scheme and a syntax for hierarchical identifiers. In principle these identifiers would be resolvable by an ARK resolver service, however the expectation is that they will be resolved instead by directly prepending a URL for a resolution service (as in ARK therefore appears to be a new scheme, in the sense that it conceives in principle of strings starting ark:..., but avoids perdition by pulling out at the last moment and defining a resolution mechanism which turns the identifiers back into HTTP URLs for maximum architectural goodness. This specification includes registration authorities, a comparison algorithm (essentially, strip everything before the ark:) and interestingly, the provision that if foo is a retrievable ARK URL, then foo? is a metadata statement, and foo?? is a persistence statement.

There's more in ARK, but it's basically quite simple, and very nice. It has a clean distinction between authority IDs and a potentially hierarchical resource key; it has object metadata and repository statements readily to hand; and its resolution mechanism is why-didn't-I-think-of-that clever, RESTful, and means that the only key-resolution software you need is string concatenation and wget. It seems a very good impedance match to what the VO seems to want, to the extent that it might be worth lifting ideas from it, or possibly even adopting it wholesale, and letting someone else deal with the punctuation and security niggles involved in making an RFC bulletproof. The Internet Draft contains quite a good discussion of the problems of defining and using persistent identifiers.

2.5 The authority identifier: opaque or semantic

The Identifiers PR describes an IVOA Resource Identifiers as something like adil.ncsa or org.astrogrid.www. This is what identifiers people refer to as a `semantic identifier', as opposed to an `opaque identifier', being a string which does not suggest any meaning at all. Having the owner implied by the name in this way fits the requirements in Ray's list. It turns out that a preference for one over the other is another theological matter, so for the benefit of those who weren't aware what deep water they were paddling into (ie, me), it seems useful to summarise the issues here.

Semantic IDs are things like DNS entries, or most URLs, in which parts of the ID have, or at least appear to have, meaning. The advantages of this are that the IDs are memorable, might persistently encode some of the ID metadata, and might include a `brand' for the data provider or mediator (such as org.astrogrid.www).

Opaque IDs, on the other hand, have no apparent meaning, and might be composed of a hash or a random string. The advantages of opaque IDs are that the IDs are not memorable, encode no metadata, include no branding, and thus can excite no loyalty (in the section on the GO above, I mentioned that its concept IDs were opaque, thus avoiding any political or emotional attachments). The lack of branding is rather unexpectedly quite important, since it removes any temptations for successor projects to deprecate IDs when it becomes politically useful to demonstrate that they are taking over the world. It is claimed that opaque IDs are generally easier to manage in practice. DOIs are opaque IDs; ARK seems happy with either.

3 ECAP and coherent ontologies

This was the European Computers and Philosophy conference in June (I wasn't really at this conference, but my wife was on the programme committee, so I just — well — marched in).

The talk that was of most relevance to us was Barry Smith talking about bio-ontologies, and what's so terribly wrong with them.

The fundamental problem, it seems, is that the set of relations which the GO uses — namely is-a and part-of — are not rich enough to accurately describe the relations that the GO claims to be describing, with the consequence that the terms are used in multiple senses, and the resulting hierarchy is incoherent, and allows meaningless conclusions to be drawn.

There's quite a long list of entertainingly daffy examples. The relation part-of means in various places `may be part of', or `is sometimes part of' or `is included as a sublist in', so that it seems that you can conclude from the GO that the `region external to a cell' is part-of a cell, and `pollen' is-a `biology'. The underlying problem here is that, although part-of and is-a should be strictly transitive properties, they are not being used as such, and because there are only two connecting relationships, there are no alternative relationships to use to express a less restrictive meaning. I think that this is an example of a failure to produce a `normalised ontology' as described above. It appears that the GO folk were already discussing this problem informally, as a failure of some GO branches to satisfy a `true-path' property, and have indeed since added a third relationship.

It seems also that the GO has grown unwieldy, so that there are now effectively three separate ontologies for cell structure, and for genes' molecular and biological functions, all huddled together in the GO. These issues are discussed in greater detail in [smith04].

Smith's response to this is to assert that there is a crying need for a broader range of relationships, and in [smith05] he describes such a list which can apparently describe all that the GO or the Open Biomedical Ontology needs, such that sensible inferences can be drawn. The additions are plausible (`located-in', `contained-in', `adjacent-to', and similar), but there are ten of them, which is eight more than the GO staggered along with before. Are the documentation and maintainance costs worth it?

Question number one is why, when it's apparently as horrifically bad as this, the GO is still in existence and apparently producing science? The answer to that depends on whom you're asking. The philosopher finds the ontology's incoherence repellent enough just by itself, but can also correctly point out that the incoherence can produce statements which are wrong, or not usefully meaningful. The biologists don't care, however, because they tend not to use the GO in situations where such problems matter. Yes, if you say `the exterior of a cell is part of a cell' it seems silly, but biologists Know What You Mean, and if they find such a statement in a situation where it actually is causing a problem, then they simply get on to the GO maintainers and badger them to fix it. That is

  1. it appears that the GO is predominantly used for searching, and classification, and activities similar to this, where it is easy and reasonable for the human users of the reasoning results to supply subconsciously the implicit extra semantics omitted from the ontology; and
  2. GO-enhanced searching is so massively better than the database-to-database hopping that biologists had to do before, that they are more than happy to put up with a few funny results, without it even registering that there's a problem.

Or in other words, the GO's reasoning is good enough.

The lesson for the VO is that this argument also works for UCDs, to the extent that their core use-case is `finding stuff', but probably won't work for Data Models, since they're more involved with `using stuff'. That's a further argument for keeping those two modelling domains compatible but clearly distinct.

4 ISWC 2005

The final meeting I was at this year was the ISWC conference in Galway. There are notes on some of the talks at the VOTech wiki — unfortunately, you'll need a VOTech login to see them.

Though there were several interesting papers and posters, and plenty of evidence of folk doing interesting things with the Semantic Web, for me the most interesting thing about this meeting was that it was rather less thought-provoking than the meetings I've described above. The two principal current concerns of the SW community are (a) to develop more interesting or general formal logics and the reasoners to handle them, and (b) to develop some totally, like, radical cool killer app which will take over the web and have all our wearable PDAs nattering semantically to each other while we go off and live in caves. Neither of these worthy goals is to be despised, of course, but it does suggest that the sort of core functionality that we need is well enough served by an ecology of existing tools, that it's already dropping off the radar of the SW community, and turning into plain old engineering.

5 The link to the VO

So what, in the end, are the lessons for the VO?


A Glossary

alignment of ontologies
mapping from one ontology to another, presumably mostly compatible one
a taxonomy plus formally declared relations between the terms in it, stating which classes are equivalent to, or subsumed within, which others
subclassing — the subsumption hierarchy is the full tree of which classes are subclasses of which others
a controlled vocabulary, distinguished from an ontology

B Bibliography

[bada04] Michael Bada, Robert Stevens, Carole Goble, Yolanda Gil, Michael Ashburner, Judith A. Blake, J. Michael Cherry, Midori Harris, and Suzanna Lewis.
A short study on the success of the gene ontology. Journal of Web Semantics, 1 no. 2, 2004.
[goble05] Carole Goble and Chris Wroe.
The Montagues and the Capulets. Comparative and Functional Genomics, 5 no. 8 pp. 623-632, 2005.
[guarino00] Nicola Guarino and Christopher A. Welty.
Towards a methodology for ontology based model engineering. In Proceedings of ECOOP-2000 Workshop on Model Engineering, 2000.
[std:webarch] Ian Jacobs and Norman Walsh.
Architecture of the world wide web, volume one. W3C Standard, 2004.
[std:rfc2718] L Masinter, H Alvestrand, D Zigmond, and R Petke.
Guidelines for new URL schemes. RFC 2718, 1999.
[std:rfc2141] R Moats.
URN syntax. RFC 2141, 1997.
[std:rfc2717] R Petke and I King.
Registration procedures for URL scheme names. RFC 2617, 1999.
[std:ivoaid] Raymond Plante, Tony Linde, Roy Williams, and Keith Noddle.
IVOA identifiers. IVOA Proposed Recommendation, 2005.
[rector03] Alan L Rector.
Modularisation of domain ontologies implemented in description logics and related formalisms including OWL. In Knowledge Capture 2003, (Sanibel Island, FL, 2003), pages 121-8. ACM, 2003.
[smith05] Barry Smith, Werner Ceusters, Bert Klagges, Jacob Kohler, Anand Kumar, Jane Lomax, Chris Mungall, Fabian Neuhaus, Alan L Rector, and Cornelius Rosse.
Relations in biomedical ontologies. Genome Biology, 6 no. 5 pp. R46, 2005.
[smith04] Barry Smith and Anand Kumar.
Controlled vocabularies in bioinformatics: a case study in the gene ontology. Drug Discovery Today: BIOSILICO, 2 no. 6 pp. 246--252, 2004.
[wolstencroft05] Katy Wolstencroft, A Brass, I Horrocks, P Lord, U Sattler, D Turi, and R Stevens.
A little semantic web goes a long way in biology. In Yolanda Gil, Enrico Motta, V Richard Benjamins, and Mark A Musen, editors, The Semantic Web -- ISWC 2005, volume 3729 of LNCS, pages 786-800. Springer, November 2005.
Norman Gray
2007/07/08 21:16:20