README: WAIS Unix release 8 b5 Sun May 10 1992 Brewster Kahle Thinking Machines Corp Brewster@think.com ------------------------------------------------------------------------ WARRANTY DISCLAIMER This software was created by Thinking Machines Corporation and is distributed free of charge. It is placed in the public domain and permission is granted to anyone to use, duplicate, modify and redistribute it provided that this notice is attached. Thinking Machines Corporation provides absolutely NO WARRANTY OF ANY KIND with respect to this software. The entire risk as to the quality and performance of this software is with the user. IN NO EVENT WILL THINKING MACHINES CORPORATION BE LIABLE TO ANYONE FOR ANY DAMAGES ARISING OUT THE USE OF THIS SOFTWARE, INCLUDING, WITHOUT LIMITATION, DAMAGES RESULTING FROM LOST DATA OR LOST PROFITS, OR FOR ANY SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES. This directory contains software for creating a sample server, protocol software, and a low level user interface building tool. See the readme file in the next directory up for details on how use this software. This file will describe the overall structure of this directory. To index files, a set of files called ir*.[ch] are used. The file ircfiles.c allows for customization to new file formats and may want to be added to by sophisticated users. irbuild.c contains the main program for this. To create a server, the file server.c is used. To make a server that does not use the inverted files (because you have your own database of some sort), change the ir.c file. This is not as easy to do as we would like it to be. The file ui.c is a low level tool for creating user interfaces. The directory ../ui is built on top of that file. The protocol software is broken into z39.50 files (z*) and wais protocol extension files (w*). These files will be changed dramatically as the standards committee makes changes. We will attempt to keep the ui.c interface the same. The rough flow of a question from user to server is shown below. These are not strictly levels of calling, rather, many of these files are tools that are called from the higher levels. ../ui/shell-ui.c /* a user interface, could be ../x, any other */ ui.c /* user interface toolkit */ wprot.c + wutil.c /* wais protocol */ zprot.c + zutil.c /* Z39.50 protocol */ transport.c /* ready the message for transport */ sockets.c /* use sockets as a transport */ transport.c zprot.c + zutil.c wprot.c + wutil.c ir.c /* server toolkit file */ irsearch.c /* searching the inverted file indices */ If you want to make a new type of server look in ir.c; if you want to make a new user interface, look in ui.c. The Indexer This is a very simple inverted file system with primitive word weighting. What is created is an inverted index of all the words over 2 characters long in the input text. Words are truncated to 20 characters. Each word has weight associated with it; the first time it occurs in a document it gets an extra weight of 5, and from then on, it gets a weight of 1 in that document. Words in the headline are worth an extra 10. Searching is done by finding the documents with the most word weights for each word in the query. Relevant document's words are worth 10 times less. For indexing files the structure is: irbuild.c /* top level main */ irtfiles.c /* parses the file */ (using ircfiles.c) /* rules for parsing files into documents */ irfiles.c /* file structures except inverted file */ irext.h /* defines interface with particular backends */ irhash.c /* add_word definition for inverted file */ irinv.c /* inverted file definition */ **** NOTE **** There are a few compile time symbols used in the building of the database files that are of particular interest to database maintainers. These symbols control the size of certain fields within the index files, and hence the total number of index objects that can be stored within these files. They must be changed to increase the number of records a database may contain. They are: DOC_TAB_ENTRY_FILENAME_ID_SIZE DOC_TAB_ENTRY_HEADLINE_ID_SIZE Each of these are set to 3 in irfiles.c, which limits the size of the filename and headline tables to 16M. Setting these to 4 allows these tables to grow to 4G (which is a limit to most file systems), but you should be aware that this will produce indexes that cannot be read by standard servers. You must also increase the the DOC_TAB_ELEMENT_SIZE (which is the sum of the size of the fields in the document table). If you get an error claiming a particular number cannot be represented in 3 bytes, this is probably the reason. These errors usually occur when indexing data types that are almost all headline (like archie and the one_line type), or when the total length of the headlines is greater than 16 megabytes (as with very large databases).
These are the contents of the former NiCE NeXT User Group NeXTSTEP/OpenStep software archive, currently hosted by Netfuture.ch.