Unity: multi-syntax unit parsing ================================ This is the unity library, which is able to parse scientific unit specifications using a variety of syntaxes. * Version: 1.1-snap.888aa0195ca4 * Repo: https://heptapod.host/nxg/unity * Release: 1.1-snap.888aa0195ca4 (888aa0195ca4, 2022 August 2) The code is made available under the terms of the 2-clause BSD licence. See the file LICENCE.txt, in the distribution, for the copyright statement and the terms. The library's canonical URI is . That URI redirects elsewhere, but **you should quote or bookmark this `purl.org` URI, rather than the one it redirects to**, as the permanent name of the library (I have relocated the repository more than once). That home page contains further downloads and formatted documentation. There is an issue tracker at [heptapod](https://heptapod.host/nxg/unity). Note that, somewhat unfortunately, the namespace URIs for the Unity subjects are `http://bitbucket.org/nxg/unity/ns/unit#`, `http://bitbucket.org/nxg/unity/ns/syntax#` and `http://bitbucket.org/nxg/unity/ns/schema#` (the library was at one time hosted at bitbucket, but the repository has been migrated away). Although these are of course somewhat arbitrary, it is undesirable that they are still at bitbucket. It would be good to change these, but that probably can't happen before some future version of the software (this is [issue #1](https://heptapod.host/nxg/unity/-/issues/1)). Goals ----- The library was written with the following goals: * Producing formal grammars of the existing and proposed standard unit syntaxes, for reference purposes. * Participating in the [VOUnit standardisation process][vounits] by acting as a locus for experimentation with syntaxes and proposed standards. * To that end, discovering edge cases and producing test cases. * Producing parsing libraries which are fast and standalone, and so could be conveniently used by other software; this also acts as an implementation of the VOUnits standard. The distribution is buildable using no extra software beyond a Java and a C compiler. It is also a goal that the distributed Java is source-compatible with Java 1.5 (though this isn't automatically tested, so bugreports are welcome). It is not a goal for this library to do any processing of the resulting units, such as unit conversion or arithmetic. Parsing unit strings isn't a deep or particularly interesting problem, but it's more fiddly than one might expect, and so it's useful for it to be done properly once. A major goal of the VOUnits process is to identify a syntax for unit strings (called 'VOUnits') which is as nearly as possible in the intersection of the various existing standards. The intention is that if file creators target the VOUnits syntax, the resulting string has a chance of being readable by as many other parsers as possible. This isn't completely possible (the OGIP syntax doesn't allow dots as multipliers), but we can get close. Version 1.0 of the [VOUnits standard][vounits] was approved on 2014 May 23. Although the library was produced as part of the VOUnits process, its use is not restricted to that syntax, and it should be useful for each of the syntaxes listed below. Outputs ------- **Yacc grammars for the three well-known syntaxes, plus a proposed 'VOUnits' grammar** These are consistent, in the sense that any string which parses in more than one of these grammars means the same thing in each case (ignoring questions of per-syntax valid units). The 'VOUnits' syntax is almost in the intersection of the three, in the sense that anything which conforms to that grammar will parse (and mean the same thing) in the others. The only exception is that the OGIP grammar uses '`*`' for multiplication, and the others accept '`.`', but that could be got around with a fairly simple character substitution. That is, if one writes out in that syntax, then it can be read in almost anything. **Parsers in multiple languages** This distribution includes parsers in Java and C. Thus the core content here is demonstrably language-agnostic. Python would be an obvious next language. **Test cases** there are more than 200 test cases, of which between 130 and 200 apply to each syntax. **A collection of 'known' units** There are multiple collections of these in circulation in different libraries, but this library gathers them together and generates per-language lookups of the information. The [VOUnits document][vounits] discusses the various compromises necessary here. Parsing units and prefixes -------------------------- The grammars defined by this library do not cover the parsing of unit prefixes since (as it turns out) this cannot be usefully done at this level, and the grammars identify, in the terminal `STRING`, only the combination of prefix+unit. These are subsequently parsed in the following manner: 1. if the whole string is a 'known unit' then this is the base unit (so 'pixel' is recognised in some of the syntaxes as a unit, and a 'Pa' is a pascal and not a peta-year); 2. or if the first character in the `STRING` is one of the SI prefixes (or the first two are 'da') and there is more than one (two) character, then that's a prefix and the rest of the string is the unit (so 'pixe' would be parsed as a pico-ixe); 3. or else the whole thing is a unit (so 'wibble' is an unknown unit called the 'wibble', 'm' is the metre and not a milli-nothing, but 'furlong' would be a femto-urlong). That is, validity checking – checking whether this is an allowed unit, or whether it's allowed to have an SI prefix – happens at a later stage from the parsing of the units string and only on request, since it's essentially an auxiliary parse. This (a) avoids the cumbersomeness of doing this check earlier, (b) separates the _grammatical_ error of having a star in the wrong place from the stylistic or semantic error of using an inappropriate unit, and (c) retains the freedom to use odd units if someone really wants to. The library also recognises the binary prefixes (kibi, mebi, and so on) of ISO/IEC 80000-13. Summary: * `pixel` --> 'pixel' in the FITS and OGIP syntaxes, the pico-ixel in CDS * `furlong/pixe` --> femto-urlong per pico-ixe * `m` --> metre in all syntaxes * `mm` --> millimetre * `dam` --> dekametre (not the deci-`am`) Notes ===== The recognised syntaxes are: * **fits**: FITS, 3.0 Sect.4.3 ([W.D. Pence et al., A&A 524, A42, 2010][fits]); v4.0 Sect.4.3 ([FITS standards page][fitsspec]); and further comments in the [FITS WCS IV paper][fitswcsiv]. * **ogip**: [OGIP memo OGIP/93-001, 1993][ogip]. * **cds**: [Standards for Astronomical Catalogues, Version 2.0, section 3.2, 2000][cds]. * **vounits**: The VOUnits syntax. This is a subset of the FITS syntax, specified by the [VOUnits specification][vounits]. The grammars are available in `src/grammar/unity.y`. Note that this file is pre-processed before it is fed into a parser generator, and isn't a valid yacc file as it stands; see the relevant targets in `src/java` and `src/c`. The grammars are implemented by (at present) two libraries, one in C and one in Java. Each of these generates its parsers directly from the grammars. See `src/c/docs` and `src/java/docs` for documentation. The Java implementation has, and will probably continue to have, more functionality than the C one. That said, I'm open to suggestions about features that are currently in the Java version that could usefully be ported to the C version. Each of the implementations supports reading and writing each of the grammars, plus LaTeX output (supported by the LaTeX `siunitx` package). The main testcases – a set of unit strings and the intended parse results – are in `src/grammar/testcases*.csv`. There are also library-specific unit tests within the source trees. If you want to experiment with the library, build `src/java/unity.jar` or `src/c/unity`. Illustrating behaviour with the Java version: % java -jar unity.jar -icds -oogip mm2/s mm**(2) /s % java -jar unity.jar -icds -ofits -v mm/s mm s-1 Checking , in input syntax cds: check: all units recognised? yes check: all units recommended? yes check: all constraints satisfied? yes Result: mm (10^-3 Metre) s^-1.0 (Second)^-1.0 % java -jar unity.jar -ifits -ocds -v merg/s merg/s Checking , in input syntax fits: check: all units recognised? yes check: all units recommended? no check: all constraints satisfied? no Result: merg (10^-3 Erg) s^-1.0 (Second)^-1.0 % java -jar unity.jar -icds -ofits -v merg/s merg s-1 Checking , in input syntax cds: check: all units recognised? no check: all units recommended? no check: all constraints satisfied? yes Result: merg (10^-3 erg) s^-1.0 (Second)^-1.0 % java -jar unity.jar -icds -ofits -v -g merg/s merg s-1 Checking , in input syntax cds: check: all units recognised? no ...with guessing? no check: all units recommended? no check: all constraints satisfied? yes Result: merg (10^-3 Erg), guessed s^-1.0 (Second)^-1.0 In the latter cases, the -v option _validates_ the input string against various constraints. The expression mm/s is completely valid in all the syntaxes. In the FITS syntax, the erg is a recognised unit, but it is deprecated; although it is recognised, it is not permitted to have SI prefixes. In the CDS syntax, the erg is neither recognised nor (a fortiori) recommended; since there are no constraints on it in this syntax, it satisfies all of them (this latter behaviour is admittedly slightly counterintuitive). In the final case, we ask the library to ‘guess’ what an `erg` is, since it is not recognised in the CDS syntax. It correctly guesses an Erg, but although this parsed propertly as an Erg (as opposed to the unknown unit `erg`), it is still not a recognised unit. The guessing process can do a little more, and for example can recognise `degrees` as Degrees (as opposed to deci-egrees`), and interpret improperly pluralised `ergs` as the Erg as well. The library of ‘known units’ draws on v1.1 of the excellent [QUDT](http://qudt.org) units ontology. See `src/qudt` in the repository. Portability ----------- The library builds on OS X (tested on 10.6 to 10.10), on Scientific Linux, on Ubuntu, and on OpenBSD (with all checks on). I don't systematically test on all these platforms, however. I have as yet made no serious attempt to port the library more broadly, but I don't anticipate problems. Reports of success or failure, and fixes, are both welcome. The Java implementation is source-compatible with Java 1.5, and unity.jar is built to be compatible with a 1.5 JRE. Building --------- The usual: % ./configure % make % make check % make install The build process requires GNU make (as opposed to BSD make). Pre-requirements: distribution tarball -------------------------------------- **No library dependencies.** To build from a distribution, the only pre-requirements are a C compiler and a JDK (1.5 or later). You can build either or both of the C and Java libraries, at your option (eg `cd src/c; make check`) If the JUnit jar is in the CLASSPATH, then `make check` will run more tests than if it's absent. Pre-requirements: repository checkout ------------------------------------- See the file [README-developer.md](README-developer.md), for instructions on building from a repository checkout. These instructions are unfortunately a little intricate, and shouldn't be necessary for most users. Limitations ----------- * Currently ignores some of the odder unit restrictions (such as the OGIP requirement that 'Crab' can have a 'milli' prefix, but no other SI prefixes). Hacking ------- The distributed source set is assembled using quite a lot of preprocessing, involving parser- and documentation-generators. It's not intended to be a useful starting point for hacking on the software. For that, see the [instructions on building from a repository checkout](README-developer.md). [vounits]: https://www.ivoa.net/documents/VOUnits/ [cds]: https://vizier.u-strasbg.fr/vizier/doc/catstd-3.2.htx [fits]: https://doi.org/10.1051/0004-6361/201015362 [fitsspec]: https://fits.gsfc.nasa.gov/fits_standard.html [fitswcsiv]: https://doi.org/10.1051/0004-6361/201424653 [ogip]: https://heasarc.gsfc.nasa.gov/docs/heasarc/ofwg/docs/general/ogip_93_001/ [dist]: https://purl.org/nxg/dist/unity [Norman Gray](https://nxg.me.uk) 2022 August 2