A DESCRIPTION OF A COMPUTER-USABLE DICTIONARY FILE BASED ON
THE OXFORD ADVANCED LEARNER'S DICTIONARY OF CURRENT ENGLISH

Roger Mitton,
Department of Computer Science,
Birkbeck College,
University of London,
Malet Street,
London WC1E 7HX

June 1992  (supersedes the versions of March and Nov 1986)


     In 1985-86 I produced a dictionary file called CUVOALD  (Computer
Usable Version of the Oxford Advanced Learner's Dictionary).  This was
a partial dictionary of English in computer-usable  form  -  "partial"
because  each  entry  contained  only some of the information from the
original  dictionary,  and  "computer-usable"  (rather   than   merely
"computer-readable")  because  it  was in a form that made it easy for
programs to access it.  A second file, called CUV2,  was  produced  at
the  same time.  This was derived from CUVOALD and was the same except
that it also contained all inflected forms explicitly, eg it contained
"added",  "adding" and "adds" as well as "add".  I have now added some
information to each entry and some more entries to CUV2, to produce  a
new version of CUV2.  This document describes this new file.

     These files were derived  originally  from  the  Oxford  Advanced
Learner's  Dictionary of Current English [1], third edition, published
by the Oxford University Press, 1974, the machine-readable version  of
which  is  available to researchers from the Oxford Text Archive.  The
task of deriving them from the machine-readable OALDCE was carried out
as  part  of  a research project, funded by the Leverhulme Trust, into
spelling correction.  The more recent additions have been carried  out
as  part  of my research as a lecturer in Computer Science at Birkbeck
College.

THE FILE FORMAT

     CUV2 contains 70646  entries.   Each  entry  occupies  one  line.
Samples  are  given at the end of this document.  The longest spelling
is 23 characters; the longest pronunciation is also  23;  the  longest
syntactic-tag  field  is  also  (coincidentally)  23;  the  number  of
syllables is  just  one  character  ('1'  to  '9'),  and  the  longest
verb-pattern  field  is  58.  The fields are padded with spaces to the
lengths of the longest, ie 23, 23, 23, 1 and  58,  making  the  record
length  128.   The spelling begins at position 1, the pronunciation at
position 24, the syntactic-tag field at position  47,  the  number  of
syllables  is  character  70,  and  the  verb-pattern  field begins at
position 71.  The file is sorted in ASCII  sequence;  this  means,  of
course, that the entries are not in the same order as in the OALDCE.

                                                                Page 2


WHAT THE DICTIONARY CONTAINS

     Each entry consists of a spelling, a pronunciation, one  or  more
syntactic  tags (parts-of-speech) with rarity flags, a syllable count,
and a set of verb patterns for verbs.

     The first file derived from the OALDCE  (CUVOALD)  contained  all
the headwords and subentries from the original dictionary - subentries
are words like "abandonment" which comes under the headword  "abandon"
-  except for a handful that contained funny characters (such as "Lsd"
where the "L" was a pound sign).  Subentries were not included if they
consisted  of  two  or three separate words that occurred individually
elsewhere in the dictionary, such as "division bell" which comes under
the   headword  "division",  except  when  the  combination  formed  a
syntactic unit not immediately predictable from its  constituents,  eg
"above  board",  which  is listed as an adverb.  To this list of about
35,000 entries, I added about 2,500 proper names -  common  forenames,
British   towns   with   a   population   of  over  5,000,  countries,
nationalities, states, counties and major  cities  of  the  world.   I
would like to have added many more proper names, but I didn't have the
time.

     The second version of the file (CUV2) contained all these entries
plus  inflected  forms  making a total of about 68,000 entries.  Since
1986 I have made a number of corrections, added the rarity  flags  and
the  syllable  counts  and  inserted about 2,000 new entries.  The new
entries, nearly all of which were derived forms of  words  already  in
the  dictionary,  were  selected from a list of several thousand words
that occurred in the LOB Corpus[3] but were not in CUV2.  I also  made
changes  to  existing  entries  where  these  were  implied by the new
entries; for example, when adding  a  plural  form  of  a  word  whose
existing  tag was "uncountable", it was necessary to change the tag of
the  singular  form.   I  also  added  about  300  reasonably   common
abbreviations (see note below).

     A number of words (ie spellings) have more than one entry in  the
OALDCE,  eg "water 1" (noun) and "water 2" (verb).  In CUV2, each word
has only one entry unless it  has  two  different  pronunciations,  eg
"abuse"  (noun  and verb).  I have departed from this rule in the case
of compound adjectives, such as "hard-working", which have a  slightly
different   stress   pattern   depending  on  whether  they  are  used
attributively ("she's a hard-working girl") or  predicatively  ("she's
very hard-working").  These are entered only once; they generally have
the attributive stress pattern except when the predicative one  seemed
the  more natural.  (See also the note below on abbreviations.) I have
also given only one entry to those words that  have  strong  and  weak
forms  of  pronunciation, such as "am" (which can be pronounced &m, @m
or m).  Generally it is the strong form that is entered.

     As regards the coverage  of  the  dictionary,  readers  might  be
interested  in  a paper by Geoffrey Sampson [4] in which he analyses a
set of words from a sample of the LOB Corpus[3] that were not in CUV2.
The  recent  additions  should have gone some way to plugging the gaps
that his study identified.