Intelligent Access to Foreign Data

The obvious, and current, solution to the problem of exchanging data structures is to convert data into and out of a consensus model, but this is now labouring under the burdens of complication and often unhappy compromises. I wish to explore a complementary approach, which gives applications the ability to directly utilise foreign data models (or ontologies) using Semantic Web and AI technologies.

This uses relatively mainstream ideas from the Information Retrieval community, and adapts existing systems.

$Revision: 1.8 $

This document first appeared as the paper counterpart of a presentation at the VOTech Kickoff meeting in Cambridge UK, November 2004.

Current modelling efforts within the VO have the goal of creating a consensus data model. Applications can translate into and out of this consensus model, allowing them to share data using O(N) rather than O(N^2) translators. I wish to describe some problems with this approach, and suggest a complementary one, which builds on it to give a fully flexible solution.

Although this contribution is within the `Resource Discovery' strand of this workshop, it is most directly concerned with how to handle data after it has been discovered. However it has some discovery aspects (in particular being a component of a system which could answer the question `do you have data with a given set of information?'), and the techniques involved would have applications elsewhere in the VO.

1 Introduction

Whenever applications read or write data, they do so by translating the data into or out of an internal data model -- the data structures used to implement the program -- and an application has built-in knowledge of a data format when it has code which can fill those data structures appropriately, when presented with a known input format. Usually this internal data model is implicit, but in some cases -- such as is the case with Starlink's TOPCAT application -- an application might expose its data model through a plugin architecture or other API. To read foreign data, therefore, an application needs a translator of some sort.

However each data source also has a data model of its own. For the sake of interoperability, and to avoid every application having to provide translators for every data source, the IVOA has led the development of consensus models (UCD and VOTable so far), so that foreign data can be read by being translated into the consensus model by the data provider, and from that model by the application, so that a single translator gives the application access to multiple sources. This is a powerful approach, but not without problems: the consensus does first have to be developed, and even when it is, the compromises involved will hit different actors in unpredictable ways.

2 One approach: The consensus data model

The consensus has been largely achieved for both VOTable and UCDs, but in both cases there was a fairly well-established draft at the beginning of a process that took longer than many expected, producing something that was less than some participants hoped. The same is largely true of the Space-Time Coordinates (STC) modelling effort, where there is a well-advanced and largely uncontested model. The IVOA data models group, starting without the first advantage, has taken longer than anyone feared, and has found it difficult to find a set of compromises that the participants do not find too painful. It may prove to be impossible to find a satisfactory solution in a single final model.

Despite these problems, I wish to emphasise that this is still a strong approach, and I am not suggesting that it be abandoned or restarted. The VOTable and UCD standards allow the VO to make progress now, and their role will continue to be important even if they don't satisfy their most optimistic goals.

It seems likely that the IVOA data modelling effort will converge on a product which is of real use in a good range of situations, but which will be either too complicated or too simple for a significant number of other uses. For example, many general-purpose applications (including for example most Starlink applications and their NDF/NDX data model) have a rather simple data model, and are concerned primarily with locating image, variance and quality information, along with some history and WCS information. On the other hand, JAC's ORAC-DR pipeline uses metadata to drive processing and thus needs an internal model richer than anything the IVOA consensus model could reasonably be. The same is also true of large archives' internal models. The bigger the consensus model's range of applicability, the longer and more painful will be the process of developing it, and the larger and more complicated will be the end product's documentation, making it less likely that those developers who do use it will do so without errors.

3 Plan B: describing foreign models

I wish to propose an alternative approach, in which we allow multiple data models to have equal status a priori, and avoid this leading to chaos by making it easy (for some value of `easy') for applications to `understand' foreign models. Here `multiple' is not necessarily `many', but it is more than we could expect a single application to implement natively, and more than we would want to standardise in a fully formal way. This does not preclude a subset -- including the current consensus models -- from being fully formally standardised. These models might range from simple to complicated, and widely to rarely implemented. Crucially, these models would include those that are particularly suitable for some wavelength range (radio or X-ray observations, perhaps) or application domain (general-purpose applications or a large-scale archive, perhaps).

We can communicate `understanding' via formal versions of statements like

That is, we make links between data models, rather than defining a data model absolutely, in isolation. If our application has a built-in understanding of one of the data models thus linked (which is the only `absolute' definition that matters), then the result of this link is that we now additionally have an effective understanding of the previously unknown model.

The links are not (and typically would not be) limited to single hops: if my model and a foreign model are both linked to a third one, then I can infer an understanding of the foreign model as a result. This intermediate might naturally be an IVOA consensus model, or a very well-known model in some area (such as the Chandra X-ray data model, or a specifically radio observations model). Also, my model need not be linked via just one intermediate, since I might explain most of my model using an IVOA consensus model, with a few details added from a more domain-specific one.

There are several technologies available for declaring this sort of relationship. The one most relevant to our present circumstances is probably OWL, standardised by the W3C, and aimed at the Semantic Web. OWL is designed for making these ontologies and declaring the relationships of their properties to each other and to those in other ontologies. As well as this standard, there are systems in existence which will do the inferencing required to answer questions like

`I know about HDX and I know about UCD: I've been given this database metadata and I want something like a UCD image centre RA: which database column should I retrieve?'.

The answer, given the above assertions, would be `column three of the table'. This response, of course, prompts a number of questions. It presumes that the application knows how to make a database query and retrieve the column in question, and that the data in the column will be of the correct type, and so on; while these are not trivial, they are questions of syntax and implementation, separate from the difficult semantic question of how we attach meaning to a piece of data.

More importantly, this example has elided the issue of how the exchange would be different if they questioning application did in fact care about the distinction between the image centre and the telescope pointing. That can be handled in a variety of ways: the question could be phrased as finding information identical to an image centre, as distinct from the looser equivalence of being merely usable in place of it; reasoners are able to give explanations of their decisions, allowing an application to reject a response post-hoc; most interestingly, it is possible to adjust or augment the set of assertions within a knowledge base so that a knowledge base could give quite different answers if it were queried in the context of astrometry work (where the difference between image centre and pointing matters) or of delivering pretty pictures or coverage information (where it doesn't).

Given a set of metadata, where would an application find an appropriate set of assertions -- the declarations of the links between the metadata's model and one or more known ones? These might be referred to by the metadata, as a way of documenting its semantics. They might be published by a data centre, either at a well-known URL or (more scalably) in a registry. They might be local to an application, if the application's authors know that for their purposes there are certain assertions they can make about a known set of foreign data models. Or they might be a third party mapping of the metadata's declared model to a variety of other more or less well known models. Though potentially intricate, these are implementation complications, rather than conceptual ones.

The final question, then, concerns the location of the inferencing service. It could be a centrally located service (a full web service, perhaps), or be local to the application which wishes to consume the data, either as a local service or as a compiled-in library. The first case is potentially easier for an application to use, but the second case would give more scope for local customisation -- that is, the addition of any user-, site- or application-specific adjustments -- of the set of assertions which pertain to a particular foreign model. In either case, once an application has consulted the inferencer, it can query the data source directly, whether it is a local FITS file or a remote database, using that source's own data model.

4 Problems

There are of course potential problems with this approach, of which the most severe is the problem of `chinese whispers' (a children's game in which a whispered message is passed along a line of players, and the initial and final versions compared). If my model and a target (foreign) one are connected through multiple links, then there is a danger that the set of equivalences can become corrupted. This is likely not to be a significant problem in practice: most links would require two hops (in and out of a common model), and more than three would be rare; a reasoner would report on its reasons for making a link between two models, allowing implausibly long chains to be discarded post-hoc; finally an isEquivalent relation would be documented as explicitly permitting such transitivity, and complemented by other relations which do not permit such a relationship.

Another major worry is that such a laissez-faire approach would create chaos as data providers, freed from the obligation to transform their products into an instance of a consensus data model, make them available in their native form alone, undoing the interoperability gains that have been so painfully achieved. This may be less of a problem than it appears -- see the notes below on the comparison between this approach and the consensus one.

5 Comparisons with the consensus approach

This is less different from the consensus model approach than it might seem at first. To read a foreign data model using the consensus approach requires two translations, in and out of the consensus model (of course, the first will have been done by the data provider before the consumer sees it), using rather simple logic of the sort provided by OWL (`if A is not available, use B'). The semantic approach described in this note will probably most commonly gain access to foreign models by an exactly analogous two-hop route, mapping a local concept to a foreign one via a concept in a well-known model (perhaps one of the consensus models which will emerge from the IVOA process). Thus the `chinese whispers' danger is no more severe in this two-hop case than in the consensus model approach. From this point of view it is reasonable to make the interoperability demand that datasets made available to the VO must have some declared mapping to a given model, instead of the current demand that they are exposed pre-converted, as an instance of the consensus model (when it emerges).

The differences between the approaches are as follows.

In the semantic approach, the reading application's data access is directly to the original dataset, be it in a database, FITS file, or XML file. The application must therefore be capable of reading from, or writing to, that format. What the interested application gains from the semantic mapping is the information about which column, or extension, or XPath it needs to access.

The consensus approach is essentially limited to this two-hop translation. The semantic approach, however, is more flexible.

It might happen that an application and a data provider will share a concept which is not, or not well, represented in the consensus model. In this case the mapping might be made via an alternative intermediate model; it might be made in more than two hops, by a mapping being made to a model which itself is indirectly mapped to the target one; or in some specialised circumstances it might even be best made directly, via an explicit one-hop mapping between the two models.

Because this provides a principled way of using a broader range of models, we no longer have to rely on the consensus data model providing all our interoperability needs. Freed from this responsibility, the data modelling process can move faster, towards a simpler and more comfortable consensus, which provides the core of common semantics without incurring the non-linear increase in costs which comes with each incremental increase in its scope.

Overall, the result is that individuals, from data providers to application authors, are able to choose which of several data models are most valuable to them, and use more than one appropriate one, without sacrificing interoperability.

6 Practicalities

There already exist systems which can handle large RDF databases, and RDF is merely the most current and fashionable of a lineage of technologies including Conceptual Graphs, Description Logics and Topic Maps, all of which have working implementations. There are, further, existing systems which can do the inferencing required -- these are part of a lineage of very many AI systems. However these systems aren't currently easy to build into applications.

I want to embark on the following work:

That is, I aim to get some sense of just how practical this would be, in the sense of answering the question of how many separate DMs the community can sustain (a few? many?), and how many would be sufficient.

7 History

$Log: intelligent-data.xml,v $
Revision 1.8  2007/09/10 07:29:48  norman
Installation and formatting tweaks

Revision 1.7  2007/07/08 21:16:20  norman
Replace myxml.lx with similar, but a bit more lightweight,
  structure.lx.  This doesn't use the ng: namespace, and so avoids
  (inter alia) all the fuss about .rnc files.
The new structure.lx outputs stuff with RDFa annotations, which works
  with http://ns.inria.fr/grddl/rdfa/2007/05/25/RDFa2RDFXML.xsl at
  least.  It needs a version of libxslt later than 1.1.11, though
  (that version -- which is the system version for OS X 10.4.10 --
  gets confused and emits the wrong RDF namespace).
Adapted the four documents here to use this.  The ones in ../* I
  haven't touched.

Revision 1.6  2006/04/03 15:11:27  norman
Fix home page link

Revision 1.5  2005/11/02 21:52:22  norman
Changed xmlns:ng namespace

Revision 1.4  2004/12/17 16:58:31  norman
A couple of clarifications, after discussion with Arthur Stutt

Revision 1.3  2004/12/08 18:21:51  norman
Fix typos.  A couple of minor rewordings.

Revision 1.2  2004/12/06 14:59:20  norman
Tidyups -- first public version


Norman
2007/09/10 07:29:48