Lithography Word Recount

I was befuddled (rank: 53,829) by my recent experience with wordcount.org (see my previous post). It seems that the word ‘lithography’ is ranked appallingly low in frequency of use, relegating me and my life’s work to the denizens of the perennially unpopular. But something smelled funny. I began to think that WordCount was not very good at counting. Since I have spent a lot of time thinking about how to measure things over the years, I decided to do what I always do when I see a data point I don’t like: blame the measurement.

I began by looking into the website’s counting method. From the wordcount.org site:

“WordCount™ is an artistic experiment in the way we use language. It presents the 86,800 most frequently used English words, ranked in order of commonness… WordCount data currently comes from the British National Corpus, a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent an accurate cross-section of current English usage.”

So WordCount is an art project. I suppose that doesn’t mean it couldn’t be accurate, though I suspect that accuracy is low on the list of success criteria for most artists. But what is the British National Corpus? I found the official BNC website, and this is what they said:

“The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written.”

British English! That explains a lot. I thought the word count would relate to real English. But since lithography was a European invention, and was certainly practiced in England, I’m not sure that this could explain lithography’s unexpected lack of popularity. True, England doesn’t have a semiconductor industry to speak of, so talk of semiconductor lithography over dinner is probably unlikely. But still, the frequency of use seemed too low, especially compared to ‘sciorto’.

I did a little more digging. Of the 100 million words in the collection, the word ‘lithography’ is used 47 times. That’s a pretty small count, even if the sample appears to be large. 100 million words is obviously not enough if you want good statistics at the tail of the distribution. The other words near ‘lithography’ on the list – luqa, calculi, tiverton, kaysone, sciorto, and bullingdon – were all tied with lithography. Digging further in the BNC website, I could even find the sources for those 47 word uses. This is where the fun begins.

Yes, Sciorto is an Italian family name, but Count Roman di Sciorto is a character from a romance novel called Calypso’s Island, the source of all 47 occurrences in the BNC. Talk about skewing the sample. Here is one example: “How ludicrous, after all, to have imagined that the great Count Romano de Sciorto, of Casa Sciorto, of the Città Notabile, the Noble City, could fall seriously in love with her.” Riveting. Tiverton, while certainly a city in England, is also a character from another romance novel, Hidden Flame, from which 19 of its 47 word-use references came. It seems that romance novels make up a fair part of the 100 million word collection. Almost every use of Bullingdon occurred on television news and refered to the prison of that name in Oxfordshire, England. What we have here is a phenomenon called ‘the sampling sucks’, caused by the lumpiness of an abysmally low sample size for these words. 100 million words seems large, but when you think about all of the words that are written and spoken in English each day, that number starts looking very small.

The bottom line is this: WordCount is art, and while it definitely has words, it doesn’t do a very good job of counting. You shouldn’t expect artists to count – that’s what nerds are for.

By the way, ‘recount’ is number 29,409 on the list. I think the wordcount.org folks need to move it a little higher up.

2 thoughts on “Lithography Word Recount”

  1. No worries, Chris.
    It’s small but it’s growing fast…
    Lithography ranks 42832 today.
    You say, it was 53829 one month ago?
    At this rate, lithography may become among 1000 most common words by YE’08.
    Or was that effect of Micro ‘2008 hitting the press?
    Alexander

  2. Sorry Alex, but it was the word "befuddled" that ranked 53,829. Lithography’s ranking has not changed. At least "lithography" is more popular than "befuddled". I think that the word count statistics are rarely if ever updated.

Leave a Reply

Your email address will not be published. Required fields are marked *