Can we just clear this up now: the word ‘data’, in english, is a singular mass noun. It is thus a grammatical and stylistic error to use it as a plural.

Plural use is barbaric: amongst other crimes, it is a deliberate archaism, and thus a symptom of bad writing.

First, some history

This section is here because the quasi-historical argument ‘data is a plural latin word’ is trotted out far too often, as justification for treating ‘data’ as a plural noun in english. This section is here to acknowledge that this statement is true, but observe that it is irrelevant, since etymology informs, but cannot control, how we use words in our living language.

Why don't you go right on to Usage, which has a much more important argument?

The latin word data (pronounced ‘dah-tah’) is the neuter plural past participle of the first conjugation verb dare, ‘to give’ (it's actually also the feminine singular past participle, but that really, really, doesn't matter). The english word ‘data’ (pronounced ‘dah-tah’ or ‘day-tah’) is a noun referring variously to measurements, observations, images, and the other raw materials of scientific enquiry. In this sense it originated in the physical sciences and computing in the mid twentieth century, and is busily and cheerfully spreading into other areas. The two words are, not surprisingly, etymologically related.

As a past participle, latin data is precisely analogous to the english word ‘given’, as in ‘I have given’. In latin as in english, past participles can be used as adjectives – ‘...accomplishing a given end’ – and even as nouns – ‘”The English language”...from the point of view of any individual a “given”, it is not something he creates for himself’; using adjectives as nouns is more common in latin than in english (these examples, like most of the others in this note, along with essentially all of the history, comes from the OED).

As a noun, latin datum is in the second declension, and neuter.

So much for latin

Latin data appears to have made its way into english in the mid 17th century (according to the OED, again), with english ‘data’ making an appearance in the 1646 sentence ‘From all this heap of data it would not follow that it was necessary’, illustrating the OED's sense 1a: ‘A thing given or granted; something known or assumed as fact, and made the basis of reasoning or calculation; an assumption or premiss from which inferences are drawn.’ (note that this very first appearance of the word in english refers to a quantity of data, a ‘heap’, rather than a number). Datum – the neuter singular past participle – makes its first appearance almost a hundred years later, in the same sense. Sense 2, for the OED, is ‘Facts, esp. numerical facts, collected together for reference or information’, first sighted in 1899. ‘Data’ as a computing term is first spotted in 1946, in the Annals of the Computation Laboratory, Harvard University: ‘Two card feeds for supplying the machine with empirical or other data’. Interestingly, this is listed as sense 1d, meaning that the OED editors think this usage is closer in meaning to the ‘thing given or granted’ of sense 1a, than to the more scientific sense of sense 2: I'm not sure I would agree.

That 1646 quotation doesn't suggest that the word was a particularly novel one, so that it will likely have had some prehistory amongst english-speakers. But that prehistory was as likely in latin as in english: it was in the seventeenth century that intellectuals across Europe were remaking science, and doing it, radically, in english, french and other vernaculars, rather than latin. As they remade science, they remade, reworked, and simply invented their technical vocabularies, against the background of the substantial technical vocabularies of the late latin in which they had been educated, and this is when words like english ‘data’ appear.

What that means is that there was almost certainly no latin word for the concept that we now identify by the english word ‘data’, and that to the extent that data was a technical term in late latin, and hence in early modern english, it surely meant nothing, or nothing much, beyond the slightly specialised mathematical meaning of the english word ‘given’ (as in ‘Given ten goats...’ as the preamble to a rather tedious algebra problem). If you know otherwise, I'd be delighted to hear.

Put another way, that means that the word ‘data’, as a technical term referring to the ore of observations, which can be painstakingly reduced to extract knowledge, is not a latin word at all. It's a native english word with a latin past, which means, bluntly, that we get to choose how to use it, and if its meaning changes over time – as it has – then its grammatical analysis can reasonably and properly migrate also.

Put yet another way, this whole section is redundant. Etymology can satisfy our curiousity about our language, or give colour, texture and ripeness to the words we use, but it can do no more than suggest how we actually use the words, when we do use them, ourselves, now.

English includes many words originally press-ganged from latin, which have changed their grammatical type.

‘Stamina’ and ‘agenda’ are two well-known ones, and ‘media’ is apparently becoming one. Separately from its botanical sense, a ‘stamen’ was the warp of a fabric, or figuratively some essential element of a thing; the word ‘stamina’ now refers to a completely different concept, which has no need, and no room, for a singular form – it makes no sense to speak of one of the things of which stamina is the plural. We can even watch the word changing its grammatical ‘number’, from plural to singular. In the list of OED usages for this sense (3a) of ‘stamina’ – from ‘her stamina could not last much longer’ (the earliest, in 1726) to ‘his stamina is gone entirely’ (1834) – we see a century or so where the word is used in contexts where its number is ambiguous; and in the first case in this list where its number is clear, it is clearly singular, and the word is being used in its modern sense. The OED's last spotting of the relevant sense of ‘stamen’ used as a singular is in 1794.

‘Agendum’ isn't quite obsolete yet, but if the chairman of a meeting talks of ‘moving on to the next agendum’, she is being deliberately and unattractively pompous, or being facetious, and if she asks ‘how many agenda are still to be discussed?’, she should be thrown bodily from the room. In any case, ‘agenda’ no longer means just a collection of ‘agendums’: it now refers to the list of agendums as a separate notion, and that meaning has expanded (unattractively and unnecessarily in my opinion) to refer to secret and probably malevolent aspirations, as in ‘I don't trust X – she has a personal agenda’.

Using ‘agenda’ or ‘stamina’ as a plural is now a grammatical error in english. This isn't being prescriptive, but descriptive, in the sense that native english speakers do not naturally use these words as plurals, and detect them as a mistake when they hear or read them thus used.

‘Media’ is edging the same way, in the sense that the word has accumulated at least one meaning beyond the simple plural of ‘medium’. This is arguably a good thing, though I wouldn't go to the stake for it, as it makes a nice distinction possible. If you're told ‘the media are outside’, you know that radio and TV have turned up – how nice: you should tell your mother to watch the evening news. If, on the other hand, you're told ‘the media is outside’, you brace yourself to bodycharge a mob of slavering hacks asking if you've stopped beating your wife, photographing what cleavage they can, and demanding that Something Must Be Done about the lurid crimes they so lovingly describe.

Using ‘phenomena’, ‘criteria’ or ‘strata’ as singulars is a grammatical no-no – these are simply irregular plurals of still-useful and still-current singulars.

In this spectrum (not ‘spectra’, of course), ‘data’ is clearly located near ‘agenda’. The difference is that, though I know what an ‘agendum’ is (and have very certain opinions about folk who use the word), I really have very little clue what a ‘datum’ is – it's certainly not one of the things that makes up data.

The strongest argument against plural ‘data’ is that plural ‘data’ looks weird; thus it is distracting; thus it is bad writing.

When you read in the middle of a sentence ‘...the data are analysed by...’, you stumble: your subconscious grammatical consistency checks raise an alarm! – you have misparsed them (yes, like that). You automatically go back to the beginning for another go, more carefully this time, but realise, too late, that you are simply reading the work of an author in his weddings-and-funerals suit, writing as he would never speak. You regather your concentration, and press on.

Because almost no-one does speak like that. This is not an argument about formal versus informal use, but about the distinction between usage and prescription. The majority of writers who would dutifully pluralise ‘data’ in writing naturally and consistently use it as a mass noun in conversation: they ask how much data an instrument produces, not how many; they talk of how data is archived, not how they are archived; they talk of less data rather than fewer; and they always talk of data with units, saying they have a megabyte of data, or 10 CDs, or three nights, and never saying ‘I have 1000 data’ and expecting to be understood.

If challenged, they will respond (with a slightly nervous smugness) that ‘data is a latin plural’. Agree to this, for the sake of professional harmony, and carry on the conversation, making sure to mention that ‘the telescope has data many odd images tonight’ (it's a past participle, remember), suggest looking at the data raw images (...or an adjective) and that you both examine the datorum variance (surely they recall the genitive plural); suggest they give you the datis (...the dative), so that you can redo the analysis with their datis (...and the ablative). If they object to all this nonsense – as well they might – ask them to explain their sentimental attachment to the nominative plural, that they would use that in all cases, in brute defiance of good latin grammar (screech ‘never did me any harm!’ – twitch and boggle at this point for maximum effect, and on a good day, you'll have them run screaming from the room).

Isn't it lucky english is now genderless, making ‘data’ neuter, else we'd have to memorise masculine dati (dati dati datos datorum datis datis) and feminine datae (whatever...), too? Isn't it simpler just to speak english?

As we saw above, the word ‘stamina’ was used ambiguously for a century or so, until the word of which it was the plural – ‘stamen’ – lost its relevant meaning. Exactly analogously, the OED's quotations for ‘data’ in the computer sense are ambiguous as to number in 1946 (twice), 1958, 1960, and 1967, unambiguously singular in 1964 and 1970 (twice), and unambiguously plural in 1969. Its ‘numerical facts’ sense has seven quotations from 1899 to 1971, of which only two, in 1946 and 1958, are clearly plural. After 1807 there are precisely zero uses of ‘datum’ in the relevant senses 1a, 1d or 2 (sense 1b is the combinative sense, as in ‘datum-line’, sense 1c is a technical sense in philosophy, and sense 3 is the combinative sense of ‘database’, ‘data processing’, and the like).

That obviously doesn't mean that the word ‘datum’ hasn't been used in these senses after 1807, but it does rather suggest that the word is going the way of non-botanical ‘stamen’, being a word for an idea that is evaporating from our language.

Ask a scientist or engineer how many data she has (go on, try it). She'll tell you how many gigabytes she has, or how many datapoints, or how many observations, or how many photocopied articles. No no, you say, not how much, but how many ‘datums’? She will look at you, I guarantee, in A Funny Way. What on earth are you asking? No-one knows, because the word ‘datum’ has lost any useful meaning (for almost everyone – see below). On those occasions when you need to refer to some indivisible atom of data, you talk of bytes, or datapoints or observations as appropriate. But never of a datum.

The word ‘datum’ is still in use by surveyors, and other folk who need datum-lines, datum-marks, datum-planes and miscellaneous datum-combinations. In precise geodesy, for example, a ‘datum’ is the term for one of several models of the shape of the earth, relative to which the heights of mountains and the positions of telescopes are measured. This usage, which has nothing to do with our atom of data, has the perfectly regular plural ‘datums’, so that in texts which discuss these things, we can read sentences like ‘Frequently, users or creators of geospatial data are unaware or unsure of the projection or datum geospatial data are in’ – this is in ‘a quick, non-technical guide on the use of datums and projections’ (this 1999 publication of the USGS Center for Biological Informatics used to be here and here, but all I can now find is a indirect reference here). This is effectively the OED's sense 1b, but not in a combinative sense. This still carefully treats the second occurrence of ‘data’ as a plural, but can you imagine the confusion in this context if this word's putative singular were still live?

People who scrupulously write ‘data’ as a plural are frequently confused when it comes to more complicated sentences. There are plenty of examples such as (in a computational grids context) a reference to ‘quantities of data so large that it is no longer feasible to analyse these data at a single central site’, thus presenting an example of ‘data’ being used as both a mass-singular and a plural in the same sentence. Similarly, and more crashingly, I have read a serious document which asked ‘What is HEP data? The data themselves...’: it is impossible for these successive sentences to be both grammatically correct. Even a conventional phrase like ‘data preservation’ is suspect: it is unusual in english for nouns used as adjectives to be in the plural – you would not talk of ‘chairs preservation’.

The word of which ‘data’ is purportedly the plural has simply disappeared; this means two things. Firstly, passively, it creates a linguistic space into which ‘data’ can drop – there is no ambiguity in using ‘data’ in a singular sense. Secondly, and more importantly, if ‘datum’ has effectively disappeared, it tells us that ‘data’ cannot be simply its plural; unanchored, it has moved away from this simply derived meaning, to a distinct and independent meaning of its own. It has accordingly accreted usage rules of its own, unencumbered by any latin past.

‘Data’ no longer means just one (damn) datum after another. Twentieth-century ‘data’ refers to a mass of raw information, which we measure rather than count, and this is as true now as it was when the word made its 1646 debut. This universal perception of data as measured rather than counted puts the word firmly and unambiguously in the same grammatical category as ‘coal’, ‘wheat’ and ‘ore’, which is that of the mass, or aggregate, noun. As such, it is always and unavoidably grammatically singular. We would never ask ‘how many wheat do you have?’ or say that ‘the ore are in the train’ if we wished to be thought a competent speaker of english; in the same way, and to the same extent, we may not ask ‘how many data do you have?’ or say ‘the data are in the file’ without committing a grammatical error.

Now here I am obviously being at least a little prescriptive. But more descriptive than it might at first appear: native speakers naturally use ‘data’ as a singular noun until someone in authority tells them not to, whether that is a journal style guide or someone with the ‘latin plural’ schtick. Thus the plural usage is maintained in the language only artificially, as a status marker – it will soon die.

As far as dictionaries go, the OED stigmatises data ‘in pl. form with sing. construction’ as (delightfully) ‘catachrestic and erroneous’, in the teeth of all their evidence I've adduced above. They rather cheekily include in their quotations illustrating this wickedness a bald and authoritative 1965 statement that ‘Incidentally, by general usage data is now accepted as a singular collective noun’. Oddly, the OED marks ‘datum’ as a ‘not naturalized, alien’ noun, but doesn't so mark ‘data’ – I feel this rather proves the point than undermines it. Despite that, the OED's ‘draft additions 2004’ refer to ‘the automated gathering of data in a form in which it can be processed by a computer’ and ‘data warehouse ... a database in which data collected from several operational systems is integrated’: none of their definitions in this draft use data as a plural, and although this is all listed under ‘datum’, the last use of that word in the entry is in (philosophical) sense 1c, and the longest senses 1d, 2 and 3 don't mention it at all.

Oxford's is generally good (it has the Correct opinion of split infinitives), but still cleaves doughtily to the ‘data is latin’ cause.

Looking at online dictionaries, the American Heritage Book of English Usage (not online) is rather evasive, and ends with the extraordinary remark that ‘When plural, data has the unusual characteristic of not being capable of modification by cardinal numbers. You may have various data but you will never have five or ten data.’ This has the unusual characteristic of indicating squarely that it's not a plural at all.

Merriam-Webster Online has a nice note on usage, saying ‘Data leads a life of its own quite independent of datum, of which it was originally the plural’ (indeed), and ending with a resigned sigh, ‘The plural construction is more common in print, evidently because the house style of several publishers mandates it’.

There's a general consensus amongst those who care enough to post about it. I suppose I'm not surprised the topic has been much blogged: entries I found include pieces by John Quiggin, John August and Kevin Drum. Though I saw these after writing this piece, there's a fair overlap in arguments and examples.

This piece also draws in spirit, and in some examples, from the chapter on ‘Data’ in Philip Howard's excellently entertaining Weasel Words (Corgi, 1978, now apparently out of print).

There are few sources which argue unashamedly that ‘data’ is a plural, beyond the authors' ritual incantation that ‘data is a latin plural’, which they seem to feel is argument enough, and the world-weary suggestion that they hear, in singular ‘data’, the hordes battering at the gates.

Much-delayed update: The exception to that dearth is a brief discussion on Andy Lawrence's blog in 2008, which led to a very detailed response from Peter Coles. That posting is entertaining, and I agree with lots of it, but the argument ultimately boils down to Peter's assertion that a sentence “If I had fewer data I would not be able to obtain an astrometric solution” is a legitimate sentence in english. I think it's not, but that's an only apparently prescriptivist conclusion arising from a descriptivist argument, namely that I believe that such a sentence would not be spontaneously produced, or recognised as correct, by a native speaker of generally ‘correct’ english who had not heard of the argument about ‘data’ (I'm taking it that Peter constructed that sentence to illustrate the point, rather than field-collected it). Since such a person would probably be a very rare beast, the question might be not be decidable through usage, which I maintain is the only truly legitimate way. That being said...

The data is in: it is massive, and it is singular.

I am indebted to Steve Draper for pointers, and Peter Coles for the best counter-arguments.

