UType questions

The IVOA is currently struggling with questions of utype syntax – how long the utypes should be, how general, what shape of brackets should they have, and so on.

It strikes me that these syntactic questions are second-order ones, secondary to more fundamental questions about what utypes are. I’d like to ask these questions here.

I am often confused about what utypes are and are for, and when I talk to people who claim to be less confused, they give a variety of very certain and confident answers, which are rarely compatible with each other. These incompatibilities are not merely differences of opinion about syntax, but appear to represent fundamentally different perceptions of how utypes are scoped and designed. Those implicit incompatibilities could seriously frustrate a utype recommendation process and the participants in it. Bluntly put, if these incompatibilities are not addressed, then utypes are broken before they are even specified.

I should emphasise that I don’t think the questions below are necessarily hard ones, and I’m sure we all have some answers to them. However, I’m sure that we all have multiple answers – possibly contradictory answers – and any utype recommendation would have to ask these questions, and make categorically explicit choices.

I imagine people will be able to guess my own answers to some of these questions, but I have deliberately avoided making these suggestions explicit below, and hope I have avoided prejudging anything in the questions and alternatives I provide.

[Update: I’ve provided my own answers to some of these questions in a separate set of proposals; see also the dm@ivoa.net list discussions.]

First: Does any of the following matter?

In any single case – such as an application parsing an SSA response – none of this matters. If there were only one standard using utypes, then utypes would be essentially a serialisation technique private to that standard (and that would work fine). But people want to use utypes in an increasingly diverse range of cases, including XML of various types, ADQL queries and responses, possibly FITS files, and surely others we haven’t thought of. In that context I would be very surprised if don’t care or doesn’t matter is a good answer to any or all of the potential problems below.

1 Definition

How are utypes defined? As of 2009 December, there is a sequence of draft documents defining utypes, edited by Mireille Louys, available on the IVOA wiki. Other documents which assert or attempt a definition of utypes are the SSA specification (section 2.11), the VOTable specification, and Jonathan McDowell’s UType note; François Bonnarel has also made an explicit proposal for parameterised utypes, and the proposal that utypes should be closely or loosely modelled on XPath has been made several times. I don't want to summarise the history of utypes here, nor survey the range of definitions.

None of the three documents, I believe, implies answers to all of the questions below. In particular, the documents do not adequately answer the question what is a utype?, since phrases such as a field of a data model, data model field, or [a concept’s] role in a given data model do not define utypes with the precision appropriate for a necessarily rather abstract concept. Furthermore, both documents define the meaning and syntax of utypes simultaneously, with the result that apparently syntactical problems are identified as conceptual limitations of utypes (see 4.1 The Uniqueness Problem and UFIs).

2 Equality

What is the equality function for utypes?

That is, given two utypes A and B, how do you determine whether A == B?

Specifically, if you are an application which can handle A, how do you recognise a utype B in, for example, an incoming VOTable.

If your answer is string-eq?(A, B), this is a simple test, but you acquire other problems. If you distinguish Char’n utypes from others with the explicit leading string cha: or Characterization. (for example), then how do you distinguish v1.0 from v1.1 utypes? If the answer to that is ‘from the context’, then what you have done is state that the equality function is actually

string-eq?(A, B) && Context(A) == Context(B)

which means that you have to define Context(A) and Context(B) in some out-of-band way, or leave it implicit/vague/unspecified/cross-your-fingers. Or you dismiss it as someone else’s problem (the application author’s?). Or you assert that the context will not matter (which is perhaps a good answer, I don’t know).

The alternative is to define utype equality using the same mechanism as element (more precisely QName) equality in XML (this does not, of course, mean that we’re talking about XML, merely that we are stealing its concrete algorithm). That is, break the utype into pfx:local (or some equivalent syntax), map pfx to namespace, and let the equality function be:

string-eq?(localA, localB) && string-eq?(namespaceA, namespaceB)

independent of the prefix string. Equivalently, this could be

string-eq?(namespaceA+localA, namespaceB+localB)

(with ‘+’ representing string concatenation). This starts to look very much like the compact URIs which have been described and used in various W3C Recommendations.

There will surely be multiple other syntactic alternatives.

3 Finding documentation

If someone comes across a VOTable with utypes in it, how do they make sense of it?

One answer is to expect that person to put the utype into google and hope that they arrive at ivoa.net, and can work out which document and DM version is the relevant one. That’s nice and simple, and google is magic. This is broadly similar to the FITS keyword problem, and people more-or-less managed to cope, there.

Another answer is to say from the context again. The person looking at the utype knows, presumably, where the VOTable came from, and can go and look for documentation there. This is the FITS solution.

A final alternative is to associate a dereferenceable URL with the Context(A), and the most obvious way to do that is to use that URL to denote a namespace, using whatever mechanism is syntactically appropriate to the context. This requires an extra step when publishing a set of utypes, but since we would presumably require utypes to be documented online anyway, making the namespace URI dereferenceable (which is regarded as good practice anyway) is a minor extra detail.

4 Scope

Is there a collection of anticipated uses of utypes?

I’m aware of explicit examples in SSA (obviously) and VOTable; I’ve heard of but not personally seen use within ADQL; Jonathan McDowell has illustrated use within a FITS serialisation; and multiple utypes are listed in the charaterisation data model, with an explicit set of utypes proposed in Utype list for the Characterisation Data Model.

Are there things which are explicitly out of scope? Are there explicit problems which utypes must address?

We can imagine a spectrum of expressiveness for utypes, ranging from utypes as little more than rebooted UCDs, to utypes as types or properties (see below), to utypes as little languages containing self-contained machine-manipulable descriptions. I can't see any reason why more expressiveness would not be good (ie, more expressiveness equals more benefit); however more expressiveness would almost certainly be associated with more cost. That cost would arise (i) from the specification process, if a large degree of consensus were required up-front; (ii) from syntax, if utypes require parsing; and (iii) from conceptual dissonance if utypes are not defined in a principled way. As with any specification process, I imagine the aim is to have as much expressiveness as possible for a cost we are collectively willing to bear.

One illustration of the apparent absence of a scoping argument is the boundary between utypes and UFIs. This brings us to the uniqueness problem.

4.1 The Uniqueness Problem and UFIs

What is the uniqueness problem – sorry, the Uniqueness Problem? I’ve seen, and had described to me, illustrations of it, but I don’t believe that utypes are so far sufficiently precisely defined to support a formal articulation of the problem, and indeed that the uniqueness problem cannot be described carefully unless utypes are first described in detail.

For example, in XPath-like syntaxes, the uniqueness problem appears as the statement that the utype XPath selects more than one element. Whatever the advantages or disadvantages of that as a syntax, it at least admits this usefully concrete statement of the problem. It also suggests that it is just a syntax problem, and might not be present in a different syntax. Is this so, or is it a more fundamental problem? The answer to the question what is a utype? has a bearing on this.

5 What things have utypes?

In the context of VOTables, the only things which have utypes, as far as I can see, are the numbers or strings appearing in the <td> elements, and I’ve been told that utypes apply only to such literals. However I’ve been told with equal confidence that in ADQL utypes can appear in the WHERE clause referring to structured things such as STC-S strings.

If the answer is only literals can have utypes (where literals means things like int, float, string, or perhaps the XSD datatypes), then it sounds like this makes things hard for ADQL.

If the answer is anything can have utypes, leaving the decision to the recommendation document which specifies the list of allowed utypes, then that might create problems in some cases. For example, a serialisation to a FITS table (see Jonathan’s example) might require the values to be FITS types.

Perhaps the answer is anything can have a utype if it can be serialised as a string (ideally one with no more than 70 characters!). But in that case how do you actually indicate that serialisation notation, and isn’t this starting to sound rather arbitrary?

6 What is a utype?

Consider a VOTable which uses utypes, such as this fragment from the Characterisation specification:

  <TABLE utype="cha:SpatialAxis">
    <DESCRIPTION>Spatial characterization</DESCRIPTION>
    <FIELD ID="Na" name="Name" arraysize="*"
      datatype="char"
      utype="cha:SpatialAxis.axisName"/>
    ...

or, more generically,

<field name='xxx' utype='cha:foo'/>

What, precisely, is this supposed to mean? Are the utypes here properties or types?

property diagramFor the sake of clarity, as I am using the terms here a property (or sometimes predicate) is a binary relation between a subject which possesses that property, and and an object which is the property’s value. A type, on the other hand, is a unary relation, annotating a value. We can also distinguish between semantic types, which indicate meaning, and lexical types which indicate how a sequence of bytes is to be parsed (is 123 intended to be the string ‘123’ or the number 123 or possibly even a julian day number?). We can generally, I am sure, ignore lexical types in a utype discussion, since these are a lower-level detail of the serialisation format, such as VOTable, VOEvent XML or FITS.

6.1 Are utypes properties?

One very natural way to read this VOTable fragment is that there is a thing which has as its cha:SpatialAxis the structured object represented by this table; that table in turn has as its cha:SpatialAxis.axisName the value which is in the Name field. That is, we can, it seems, regard utypes as properties, and the VOTable specification (implicitly) indicates how to find the value of the property. In this case, it appears that the value of the cha:SpatialAxis is the table it is attached to, and the value of the cha:SpatialAxis.axisName is found in the field or column which the <FIELD> element describes.

In this view, the datatype attribute indicates the lexical type of the value, as distinct from the semantic type – that is, the bit of information that says that the name ‘Intensity’ shouldn’t be assigned to a string-valued observer field. In this picture, the (semantic) type of the cha:SpatialAxis.axisName property isn’t specified, but is presumably deducible from the property name.

Jonathan’s example FITS serialisation (see 4 Scope) is a set of key-value pairs, where the utypes are the keys, and the values are restricted to the set of FITS (lexical) types. That makes the utypes look a lot like properties. We’ve seen multiple examples of XML serialisations, and utypes built up from element names: the most natural way of interpreting these is that element names are properties, and their values have meanings and syntactic forms described by XSchema types. Similarly, if utypes are built up from the property arcs in a UML diagram, then it seems slightly perverse for the resulting utype to be called a type, and not a property.

6.2 Are utypes types?

In the VOTable example above, is cha:SpatialAxis.axisName or cha:foo a type, which indicates, effectively, how to interpret the contents of the relevant column or XML element? In that case: what is the property which has a value which has that type?, which is the thing which possesses this property?, and how are these typed objects to be assembled into a useful description?

A more concrete way of putting this is: is a utype a key which tells you how to deserialise a value bearing that utype (in which case it’s a type, and the uniqueness problem is something or other to do with this deserialiser firing more than once), or is it one of a number of bits of information which an application would have to assemble into a structured data object (in which case it’s probably a property, and uniqueness is concerned with whether or not a thing can have more than one property with a particular name).

If utypes are types, is the utype definition expected to include the units the value has, and even the lexical form of the value, or is this the role of the datatype and unit VOTable attributes? Is it therefore fair to say that datatype and unit are to do with the lexical representation of a value, and the utype is the post-parse, or serialisation-independent, meaning of the item? If so, then an RA of 15deg and one of 0.26rad both have the same utype and datatype (a number) but different units, and that an RA of 01:00:00 has the same utype but a different datatype? Should a utype for a description, for example, have anything to say about Unicode?

That sounds overly fussy, but it makes a difference to whether the FITS serialisation has a float in the value position or a string, and if utypes are to be applicable to STC-S strings, we’re forced to ask whether the definition of a position utype is expected to acknowledge the 70-character limit. If utypes are properties, then it’s natural and tidy for a utype’s definition to say nothing about the value’s units and lexical form (devolving such questions to the definition of a specific serialisation); if utypes are types, then the boundary between the utype’s definition and the serialisation’s definition becomes blurrier.

In the FITS example, it’s not impossible to say that the utypes are types, rather than properties, and the serialisation consists of a set of type-value pairs where the types are sufficiently precisely defined that there is only one way they will fit together, jigsaw-like, to indicate a deserialised object. Or in the SpatialAxis example at the beginning of this section, if cha:SpatialAxis and cha:SpatialAxis.axisname are to be regarded as types, then what this VOTable fragment is saying is that there exists a thing with type cha:SpatialAxis, and a thing with type cha:SpatialAxis.axisName, so we can rely on an application knowing there is only one way these can slot together, and so being able to reassemble the whole structure. This underspecification is not obvious in this VOTable example, because the XML containment provides an implicit relationship between the object of type cha:SpatialAxis and the object described by the thing of type cha:SpatialAxis.axisName, but this is to some extent a coincidence of the serialisation format, and not something that would naturally fit into a utype specification.

In this picture, we deduce the properties which possess values of these types. That’s potentially workable, but it does feel a little back-to-front, and it surely leads to some sort of complicated and unanalysable uniqueness problem in those cases where the pieces can fit into the jigsaw in more than one way.

In this view, there's really only one property defined, hasA, and the structured information implied by the UML diagrams in, for example, the Characterisation data model must be deduced by reasoning that the only thing that potentially hasA ChAxis.accuracy.statError.flavor is a ChAxis.accuracy.statError, and as long as there’s only one of them, we can tie these two bits of information together. Which spec has the responsibility of articulating these relationships, and how is it expressed. Or is this the responsibility of a serialisation specification, on inspection of a UML diagram? Is this the uniqueness problem?

Perhaps the answer is a utype X is a property X which has type X (both at once). Perhaps there are both properties and types here. Perhaps the answer is sometimes it’s a type and sometimes it’s a property. Doesn't that sound like a recipe for trouble?

7 and finally...

And is it utype, Utype, UType, or UTYPE?

Appendices

Notes

Acknowledgements

Thanks to Mireille Louys for comments on an earlier version of this note.

Document history

2009 December 3
Add references to utype drafts on ivoa.net wiki, and a forward link to my own set of proposals.
2009 May 12
Amended and clarified, after comments from Mireille
Norman
2009-12-04 17:39 +0000