UType proposals

I present a proposal for structuring and justifying UTypes, which is intended to complement the 2009 May 24 Louys draft-0.3 proposal.

On 2009 May 24, Mireille Louys circulated a preliminary draft proposal for Utypes, for comments within the DM working group. This document is a response to that.

[Update: Mireille has since produced a further 0.4 draft, also available on the the IVOA wiki. The proposal on this page is to some extent a response to a list of utype questions I produced some while ago; detailed correspondences are discussed below in Sect. 5 Answers to Questions; see also the discussion on the dm@ivoa.net list.]

This document is not intended to be a counterproposal. I believe it is at heart the same proposal as Mireille's, but arrived at from a rather different direction, and so justified in a different way. The principal syntactic difference is that the model-name Utype elements are here references to a namespace, rather than regarded as the namespace themselves.

For the sake of clarity, I have presented this in the assertive style of a proposed standard; it is in fact more tentative than that, and with more rationale and commentary (presented like this) than would likely be in an actual standard.

1 Background

Goals:

  1. The Utypes for a data model are to be used as keys in a set of key-value pairs, where the values are literals (that is, numbers or strings or the like, and specifically not structured objects). The set of key-value pairs allows an application to reconstruct an instance of the object that the data model describes. Each key may appear at most once.
  2. The Utype specification is to be neutral as to serialisation format.

Consider the XML fragment below:

<characterisationAxis>
  <axisName>space</axisName>
  <numBins>16</numBins>
</characterisationAxis>

Consider also the FITS card below

NUM-BINS= 16

Do these two fragments say the same thing?

Yes, they do. We can tell this because in each case we can (mentally) draw a picture which looks like this:

number of bins property

In each case, we can clearly define a rule for mechanically going from that picture (or, more concretely, from a set of structures in memory, or whatever it is which is the post-parse result of reading these files) to each of the given serialisations, and we can define a rule for recovering that picture from the serialisation.

The property number-of-bins expresses the relationship between the thing which has this observation date and the date itself. It is an element of a data model. The string number-of-bins is obviously a poor one, since it gives no indication of which data model the property comes from, nor where to find more information. It is important to specify what features a better name would have, and the resulting well-featured name is termed a UType.

This approach – identifying the abstract property name as the interface between a data model and a serialisation – allows us to draw a clear boundary between the various parts of the problem, since it becomes clear that:

  1. the the job of a serialisation is simply to transform back and forth between the serialisation and the abstract form;
  2. the serialisations can be seen to be completely independent of each other, and independent of the form eventually chosen for the UType name; and so
  3. as long as it possesses the general good features we wish a UType to have, the form of the UType is effectively arbitrary, and there is therefore no need to specify a means for its construction.

Put another way, identifying the picture above as an interface specification means that the details of the serialisation, and the algorithm for constructing the UType names, become mere implementation details.

2 UType names

What are the features which the property name must have?

  1. It must be possible to separately name each element of each version of a data model. This is surely fundamental, since this level of unambiguousness is the point of UTypes. And unless the modellers can guarantee to get their model right first time, there will need to be a version 1.1.
  2. It should be easy to find human-readable, and machine-readable, documentation for the UTypes. If the UType is purely a string, with few or no restrictions on its syntax, then the only way to find out information about an unfamiliar UType is either to look it up in some registry (which requires that such a registry be set up, and that UType makers do consistently register UTypes), or google it. While the latter would probably work in many cases, having google be a crucial part of one's infrastructure seems poor design.
  3. In normal use, it must be reasonable for an application to treat the UTypes as opaque strings, once they have been deserialised. If we were to require that applications parse the UTypes to obtain essential information, we would be placing an uncomforable burden on application authors, and a burden on the authors of the standard, who would have to invent, specify and debug a syntax. This doesn't mean that that UTypes mustn't be parseable or otherwise processable, but any information obtained this way should be extra or rarely required.
  4. UTypes should be reasonably readable by a developer. In the design presented here, there is no technical reason why a UType should be readable – it is not parsed, and can be treated by the application as if it were opaque. However there is little reason for it not to be readable, and it is frequently convenient for a developer, or someone reading a raw serialised file, to be able to understand it without constantly looking up otherwise meaningless strings.
  5. It is not a requirement that the UTypes be short. There would appear to be some value to the UTypes being short enough to be easily presented in an interface, but this is to conflate the UType itself with its display label.

We note that a URL meets all of these criteria, and propose that UTypes should be dereferenceable URLs. For example, one might imagine a Characterisation UType such as:

http://www.ivoa.net/Documents/Characterisation-1.13.html#Char.SpatialAxis.NumBins

This might appear in a specific serialisation as:

char:Char.SpatialAxis.NumBins

if there were some separate syntactic mechanism, appropriate to that serialisation, for associating the string char: with the URL http://www.ivoa.net/Documents/Characterisation-1.13.html#.

In use, an application would not have to parse this UType, and could regard it as a completely opaque string. Since an application only deals with the post-parse result of a deserialisation, the serialisation technique has absolute freedom to transform this UType in any way. Thus it would be natural and tidy (but generally not necessary) for an XML serialiser to use XML namespaces when generating its output file, and it would be necessary for a FITS serialisation to perform some transformation to fit keyword-value pairs into an 8+70 character FITS card image.

This has the following advantages.

  1. The combination of the DNS and web server means that namespace uniqueness is trivially available without registration.
  2. If a UType can be dereferenced to give documentation, then users and developers can immediately find authoritative explanations of what a UType is intended to mean.
  3. The design is forwards compatible with reasonably anticipated developments on the world-wide web.

The requirement that UTypes be dereferenceable does not mean that software would be expected to dereference them frequently. Since the content of the retrieved information would generally be static, being the result of a standardisation process, it could be very aggressively cached, and might for example only need to be retrieved during a application build process.

3 Composite UTypes

Consider now a slightly more complicated case, which has elements from both the Characterisation and STC namespages:

<characterisationAxis>
  <axisName>spatial</axisName>
  <!-- ... -->
  <coverage>
    <location>
      <coord coord_system_id="TT-ICRS-TOPO">
        <stc:Position2D>
          <!-- ... -->
          <stc:value2>
            <stc:C1>132.4210</stc:C1>
            <stc:C2>12.1232</stc:C2>
          </stc:value2>
        </stc:Position2D>
      </coord>
    </location>
  </coverage>
</characterisationAxis>

How do we picture this? One obvious possibility is:

simple representation of positions

A suitable UType for the coord-value-c1 property might be (following Mireille Louys's draft) char:coverage.location.coord;stc:Position2D.value2.C1, but this syntax appears slightly arbitrary.

Another way of picturing this case is as follows:

indirect representation of positions

In this picture, the char:coverage.location.coord UType has as its value an object of type stc:Position2D, which in turn possesses properties stc:value2.C1 and stc:value2.C2. At this point we have two choices:

The latter picture gets us to the same point as in Mireille's proposal, but in a way which reflects the relationship with the structured underlying model, and which makes clear how this approach could be extended to more elaborate situations if that were necessary. In this view, a UType is a unique sequence of what we might call proto-UTypes which ends in a literal value.

A FITS analogue might use FITS-WCS keywords to serialise this abstract structure, or might serialise it using a more direct keyword-value technique, using the same UTypes.

4 Requirements when publishing a set of UTypes

Dereferenceable to give HTML docs. See section 4.1 Human-readable documentation.

Dereferenceable to give machine-readable information. See section 4.2 Machine-readable documentation.

UType publishers should choose a long-term stable URL for their namespace. The natural domain for this is some location under www.ivoa.net, identified as part of the data model standardisation process. The standard document's URL is an obvious first choice.

Given that UML is becoming a popular data modelling language, this standard should publish a set of XSLT scripts which transform an XMI file into HTML and XML or RDF, to help UType publishers. These scripts could embody recommended practice for how to generate UTypes from UML model entities.

4.1 Human-readable documentation

A set of UTypes would be defined by some standard document, published at a URL which will remain stable over a timescale of at least decades. We can expect that these will be www.ivoa.net URLs, without closing the door to future www.iau.org UTypes for example. If each of the UType definitions in that document is associated with an HTML <a name="utype-name"> element, then the requirement for human-readable documentation behind the UType has been immediately and fully met.

4.2 Machine-readable documentation

The requirements for machine-readable documentation need be neither onerous nor exotic. This would be retrieved from the same URL, by requesting an appropriate non-HTML MIME type. It is at this point that a preferred label (or display name, or short name) might be declared for the object. As noted above, this would not typically be retrieved by the running application, but only rarely, such as during a software build. It is an open question what form this extra information might take, but XML, RDF and JSON are all defensible possiblities.

5 Answers to Questions

This proposal provides concrete answers to the recent list of UType questions I asked. Specifically:

Appendices

Notes

Document history

2009 December 4
Add reference to updated utype drafts.
Norman Gray
2009-12-04 17:39 +0000