ftp.nice.ch/pub/next/connectivity/infosystems/WAIStation.1.9.6.N.b.tar.gz#/WAIS/ir

README: WAIS Unix release 8 b5				Sun May 10 1992
Brewster Kahle					 Thinking Machines Corp
Brewster@think.com 
------------------------------------------------------------------------

WARRANTY DISCLAIMER

This software was created by Thinking Machines Corporation and is distributed 
free of charge.  It is placed in the public domain and permission is granted 
to anyone to use, duplicate, modify and redistribute it provided that this 
notice is attached.

Thinking Machines Corporation provides absolutely NO WARRANTY OF ANY KIND with
respect to this software.  The entire risk as to the quality and performance 
of this software is with the user.  IN NO EVENT WILL THINKING MACHINES 
CORPORATION BE LIABLE TO ANYONE FOR ANY DAMAGES ARISING OUT THE USE OF THIS
SOFTWARE, INCLUDING, WITHOUT LIMITATION, DAMAGES RESULTING FROM LOST DATA OR
LOST PROFITS, OR FOR ANY SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES.


This directory contains software for creating a sample server, protocol
software, and a low level user interface building tool.  See the readme
file in the next directory up for details on how use this software.  This
file will describe the overall structure of this directory.

To index files, a set of files called ir*.[ch] are used.  The file
ircfiles.c allows for customization to new file formats and may want to be
added to by sophisticated users.  irbuild.c contains the main program for
this.

To create a server, the file server.c is used.

To make a server that does not use the inverted files (because you have
your own database of some sort), change the ir.c file.  This is not as easy
to do as we would like it to be.

The file ui.c is a low level tool for creating user interfaces.  The
directory ../ui is built on top of that file.

The protocol software is broken into z39.50 files (z*) and wais protocol
extension files (w*).  These files will be changed dramatically as the
standards committee makes changes.  We will attempt to keep the ui.c
interface the same.

The rough flow of a question from user to server is shown below.  These are
not strictly levels of calling, rather, many of these files are tools that
are called from the higher levels.

../ui/shell-ui.c	/* a user interface, could be ../x, any other */
    ui.c		/* user interface toolkit */
   wprot.c + wutil.c 	/* wais protocol */
  zprot.c + zutil.c	/* Z39.50 protocol */
 transport.c		/* ready the message for transport */
sockets.c		/* use sockets as a transport */
 transport.c		
  zprot.c + zutil.c	
   wprot.c + wutil.c 	
    ir.c		/* server toolkit file */
     irsearch.c		/* searching the inverted file indices */

If you want to make a new type of server look in ir.c;
if you want to make a new user interface, look in ui.c.


The Indexer

This is a very simple inverted file system with primitive word weighting.

What is created is an inverted index of all the words over 2 characters
long in the input text.  Words are truncated to 20 characters.  Each word
has weight associated with it; the first time it occurs in a document it
gets an extra weight of 5, and from then on, it gets a weight of 1 in that
document.  Words in the headline are worth an extra 10.
  Searching is done by finding the documents with the most word weights for
each word in the query.  Relevant document's words are worth 10 times less.

For indexing files the structure is:
irbuild.c		/* top level main */
 irtfiles.c 		/* parses the file */ 
  	(using ircfiles.c) /* rules for parsing files into documents */
  irfiles.c		/* file structures except inverted file */
   irext.h 		/* defines interface with particular backends */
    irhash.c		/* add_word definition for inverted file */
     irinv.c		/* inverted file definition */

**** NOTE ****

There are a few compile time symbols used in the building of the database
files that are of particular interest to database maintainers.  These
symbols control the size of certain fields within the index files, and
hence the total number of index objects that can be stored within these
files.  They must be changed to increase the number of records a database
may contain.

They are:

 DOC_TAB_ENTRY_FILENAME_ID_SIZE

 DOC_TAB_ENTRY_HEADLINE_ID_SIZE

Each of these are set to 3 in irfiles.c, which limits the size of the
filename and headline tables to 16M.  Setting these to 4 allows these
tables to grow to 4G (which is a limit to most file systems), but you
should be aware that this will produce indexes that cannot be read by
standard servers.  You must also increase the the DOC_TAB_ELEMENT_SIZE
(which is the sum of the size of the fields in the document table).

If you get an error claiming a particular number cannot be represented in 3
bytes, this is probably the reason.  These errors usually occur when
indexing data types that are almost all headline (like archie and the
one_line type), or when the total length of the headlines is greater than
16 megabytes (as with very large databases).

These are the contents of the former NiCE NeXT User Group NeXTSTEP/OpenStep software archive, currently hosted by Netfuture.ch.