A DESCRIPTION OF A COMPUTER-USABLE DICTIONARY FILE BASED ON THE OXFORD ADVANCED LEARNER'S DICTIONARY OF CURRENT ENGLISH Roger Mitton, Department of Computer Science, Birkbeck College, University of London, Malet Street, London WC1E 7HX June 1992 (supersedes the versions of March and Nov 1986) In 1985-86 I produced a dictionary file called CUVOALD (Computer Usable Version of the Oxford Advanced Learner's Dictionary). This was a partial dictionary of English in computer-usable form - "partial" because each entry contained only some of the information from the original dictionary, and "computer-usable" (rather than merely "computer-readable") because it was in a form that made it easy for programs to access it. A second file, called CUV2, was produced at the same time. This was derived from CUVOALD and was the same except that it also contained all inflected forms explicitly, eg it contained "added", "adding" and "adds" as well as "add". I have now added some information to each entry and some more entries to CUV2, to produce a new version of CUV2. This document describes this new file. These files were derived originally from the Oxford Advanced Learner's Dictionary of Current English [1], third edition, published by the Oxford University Press, 1974, the machine-readable version of which is available to researchers from the Oxford Text Archive. The task of deriving them from the machine-readable OALDCE was carried out as part of a research project, funded by the Leverhulme Trust, into spelling correction. The more recent additions have been carried out as part of my research as a lecturer in Computer Science at Birkbeck College. THE FILE FORMAT CUV2 contains 70646 entries. Each entry occupies one line. Samples are given at the end of this document. The longest spelling is 23 characters; the longest pronunciation is also 23; the longest syntactic-tag field is also (coincidentally) 23; the number of syllables is just one character ('1' to '9'), and the longest verb-pattern field is 58. The fields are padded with spaces to the lengths of the longest, ie 23, 23, 23, 1 and 58, making the record length 128. The spelling begins at position 1, the pronunciation at position 24, the syntactic-tag field at position 47, the number of syllables is character 70, and the verb-pattern field begins at position 71. The file is sorted in ASCII sequence; this means, of course, that the entries are not in the same order as in the OALDCE. Page 2 WHAT THE DICTIONARY CONTAINS Each entry consists of a spelling, a pronunciation, one or more syntactic tags (parts-of-speech) with rarity flags, a syllable count, and a set of verb patterns for verbs. The first file derived from the OALDCE (CUVOALD) contained all the headwords and subentries from the original dictionary - subentries are words like "abandonment" which comes under the headword "abandon" - except for a handful that contained funny characters (such as "Lsd" where the "L" was a pound sign). Subentries were not included if they consisted of two or three separate words that occurred individually elsewhere in the dictionary, such as "division bell" which comes under the headword "division", except when the combination formed a syntactic unit not immediately predictable from its constituents, eg "above board", which is listed as an adverb. To this list of about 35,000 entries, I added about 2,500 proper names - common forenames, British towns with a population of over 5,000, countries, nationalities, states, counties and major cities of the world. I would like to have added many more proper names, but I didn't have the time. The second version of the file (CUV2) contained all these entries plus inflected forms making a total of about 68,000 entries. Since 1986 I have made a number of corrections, added the rarity flags and the syllable counts and inserted about 2,000 new entries. The new entries, nearly all of which were derived forms of words already in the dictionary, were selected from a list of several thousand words that occurred in the LOB Corpus[3] but were not in CUV2. I also made changes to existing entries where these were implied by the new entries; for example, when adding a plural form of a word whose existing tag was "uncountable", it was necessary to change the tag of the singular form. I also added about 300 reasonably common abbreviations (see note below). A number of words (ie spellings) have more than one entry in the OALDCE, eg "water 1" (noun) and "water 2" (verb). In CUV2, each word has only one entry unless it has two different pronunciations, eg "abuse" (noun and verb). I have departed from this rule in the case of compound adjectives, such as "hard-working", which have a slightly different stress pattern depending on whether they are used attributively ("she's a hard-working girl") or predicatively ("she's very hard-working"). These are entered only once; they generally have the attributive stress pattern except when the predicative one seemed the more natural. (See also the note below on abbreviations.) I have also given only one entry to those words that have strong and weak forms of pronunciation, such as "am" (which can be pronounced &m, @m or m). Generally it is the strong form that is entered. As regards the coverage of the dictionary, readers might be interested in a paper by Geoffrey Sampson [4] in which he analyses a set of words from a sample of the LOB Corpus[3] that were not in CUV2. The recent additions should have gone some way to plugging the gaps that his study identified.