grep-1.6

This README documents GNU e?grep version 1.6.  All bugs reported for
previous versions have been fixed.

See the file INSTALL for compilation and installation instructions.

Send bug reports to bug-gnu-utils@prep.ai.mit.edu.

GNU e?grep is provided "as is" with no warranty.  The exact terms
under which you may use and (re)distribute this program are detailed
in the GNU General Public License, in the file COPYING.

GNU e?grep is based on a fast lazy-state deterministic matcher (about
twice as fast as stock Unix egrep) hybridized with a Boyer-Moore-Gosper
search for a fixed string that eliminates impossible text from being
considered by the full regexp matcher without necessarily having to
look at every character.  The result is typically many times faster
than Unix grep or egrep.  (Regular expressions containing backreferencing
may run more slowly, however.)

GNU e?grep is brought to you by the efforts of several people:

	Mike Haertel wrote the deterministic regexp code and the bulk
	of the program.

	James A. Woods is responsible for the hybridized search strategy
	of using Boyer-Moore-Gosper fixed-string search as a filter
	before calling the general regexp matcher.

	Arthur David Olson contributed code that finds fixed strings for
	the aforementioned BMG search for a large class of regexps.

	Richard Stallman wrote the backtracking regexp matcher that is
	used for \<digit> backreferences, as well as the getopt that
	is provided for 4.2BSD sites.  The backtracking matcher was
	originally written for GNU Emacs.

	D. A. Gwyn wrote the C alloca emulation that is provided so
	System V machines can run this program.  (Alloca is used only
	by RMS' backtracking matcher, and then only rarely, so there
	is no loss if your machine doesn't have a "real" alloca.)

	Scott Anderson and Henry Spencer designed the regression tests
	used in the "regress" script.

	Paul Placeway wrote the manual page, based on this README.

If you are interested in improving this program, you may wish to try
any of the following:

1.  Replace the fast search loop with a faster search loop.
    There are several things that could be improved, the most notable
    of which would be to calculate a minimal delta2 to use.

2.  Make backreferencing \<digit> faster.  Right now, backreferencing is
    handled by calling the Emacs backtracking matcher to verify the partial
    match.  This is slow; if the DFA routines could handle backreferencing
    themselves a speedup on the order of three to four times might occur
    in those cases where the backtracking matcher is called to verify nearly
    every line.  Also, some portability problems due to the inclusion of the
    emacs matcher would be solved because it could then be eliminated.
    Note that expressions with backreferencing are not true regular
    expressions, and thus are not equivalent to any DFA.  So this is hard.

3.  Handle POSIX style regexps.  I'm not sure if this could be called an
    improvement; some of the things on regexps in the POSIX draft I have
    seen are pretty sickening.  But it would be useful in the interests of
    conforming to the standard.

4.  Replace the main driver program grep.c with the much cleaner main driver
    program used in GNU fgrep.

README.cray

(Message inbox:135)
Date:    Mon, 17 Oct 88 16:53:33 PDT
To:      mike@wheaties.ai.mit.edu
cc:      darin%pioneer@eos.arc.nasa.gov, luzmoor@violet.berkeley.edu
From:    James A. Woods <jaw@eos.arc.nasa.gov>
Subject: README.cray for GNU e?grep

I just sent this out to comp.unix.cray:

-------------------------------------------------------------------
From: jaw@eos.UUCP (James A. Woods)
Newsgroups: comp.unix.cray
Subject: GNU e?grep on Cray machines
Message-ID: <1750@eos.UUCP>
Date: 17 Oct 88 23:47:29 GMT
Organization: NASA Ames Research Center, California
Lines: 66

# "What comes after silicon?  Oh, gallium arsenide, I'd guess.  And after 
   that, there's a thing called indium phosphide."
	-- Seymour Cray, Datamation interview, circa 1980

     Now that most Cray software development is done on Crays themselves, 
thanks to Unix, GNU e?grep should come in handy.  Of course, if you're
scanning GENBANK for the Human Genome Project at 10 MB/second (the raw
X/MP Unix I/O rate), you really do need the speed.

     Sample, from one of the Ames Cray 2 machines:

	stokes> time ./egrep astrian web2		# GNU egrep
	alabastrian
	Lancastrian
	Zoroastrian
	Zoroastrianism
	0.5980u 0.0772s 0:01 35%
	stokes> time /usr/bin/egrep astrian web2	# ATT egrep
	alabastrian
	Lancastrian
	Zoroastrian
	Zoroastrianism
	7.6765u 0.1373s 0:15 49%

(web2 is a 2.4 MB wordlist, standard on BSD Unix.)

     To bring up GNU E?GREP, ftp Mike Haertel's version 1.1 package from
'prep.ai.mit.edu' or 'ames.arc.nasa.gov'.  Mention -DUSG in the Makefile,
and specify 

	#define SIGN_EXTEND_CHAR(c) ((c)>(char)127?(c)-256:(c))

in regex.c. [Cray characters, like MIPS chars, are unsigned, but the
compiler won't allow ... #define SIGN_EXTEND_CHAR(c) ((signed char) (c))]

     However, at least on the Cray 2, there's a compiler bug involving the
increment operator in complex expressions, which requires the following
modification (also in regex.c):

change
        m->elems[m->nelem++].constraint |= s2->elems[j++].constraint;
to
        m->elems[m->nelem].constraint |= s2->elems[j].constraint;
        m->nelem++;
        j++;

Thanks go to Darin Okuyama of NASA ARC for providing this workaround.

-- James A. Woods (ames!jaw)
   NASA Ames Research Center

P.S.  
Though Crays are not at their best pushing bytes, the timing difference
is even more exaggerated with heavier regexpr processing, to wit:

	time ./egrep -i 'as.*Trian' web2
	...
	0.7677u 0.0769s 0:01 44%
vs.
	time /usr/bin/egrep -i 'as.*Trian' web2
	...
	16.1327u 0.1379s 0:32 49%

which is a mite unfair given a known System 5 egrep -i gaffe.  You get
extra credit for vectorizing the inner loop of the Boyer/Moore/Gosper
code, though changing all chars to ints might help also.

README.sunos4

[ N.B. This bug strikes on a Sun 3 running SunOS 4 with the cc -O4 option
  as well as on the sparc.  -Mike ]

Date:    Fri, 24 Feb 89 15:36:40 -0600
To:      mike@wheaties.ai.mit.edu
From:    Dave Cohrs <dave@cs.wisc.edu>
Subject: bug + fix in gnu grep 1.2 (from prep.ai.mit.edu)

I tried installing the GNU grep 1.2 on a Sun4 running 4.0.1 and
"Spencer test #36" failed.  After some experimenting, I found and
fixed the bug.  Well, actually, the bug in the the C compiler, but
I managed a workaround.

Description:

The Sun4 4.0.1 C compiler with -O doesn't generate the correct for
statements of the form
	if("string")
		x;
	else
		y;
To be exact, "y;" gets executed, while "x;" should.  This causes the
#define FETCH() to fail for test #36.

Fix:

In an #ifdef sparc in dfa.c, I made two versions of FETCH, FETCH0() and
the regular FETCH().  The former takes only one argument, the latter
expects its 2nd argument to contain a non-nil string.  This removes
the need to test the constant strings, and the compiler bug isn't
exercised.  I then changed the one instance of FETCH() with a nil
second argument to be FETCH0() instead.

dave cohrs

===================================================================
RCS file: RCS/dfa.c,v
retrieving revision 1.1
diff -c -r1.1 dfa.c
*** /tmp/,RCSt1a05930	Fri Feb 24 15:32:33 1989
--- dfa.c	Fri Feb 24 15:23:34 1989
***************
*** 285,293 ****
--- 285,315 ----
  				   is turned off). */
  
  /* Note that characters become unsigned here. */
+ #ifdef sparc
+ /*
+  * Sun4 4.0.1 C compiler can't compare constant strings correctly.
+  * e.g. if("test") { x; } else { y; }
+  * the compiler will not generate code to execute { x; }, but
+  * executes { y; } instead.
+  */
+ #define FETCH0(c)   		      \
+   {			   	      \
+     if (! lexleft)	   	      \
+       return _END;	   	      \
+     (c) = (unsigned char) *lexptr++;  \
+     --lexleft;		   	      \
+   }
  #define FETCH(c, eoferr)   	      \
    {			   	      \
      if (! lexleft)	   	      \
+       regerror(eoferr);  	      \
+     (c) = (unsigned char) *lexptr++;  \
+     --lexleft;		   	      \
+   }
+ #else
+ #define FETCH(c, eoferr)   	      \
+   {			   	      \
+     if (! lexleft)	   	      \
        if (eoferr)	   	      \
  	regerror(eoferr);  	      \
        else		   	      \
***************
*** 295,300 ****
--- 317,323 ----
      (c) = (unsigned char) *lexptr++;  \
      --lexleft;		   	      \
    }
+ #endif sparc
  
  static _token
  lex()
***************
*** 303,309 ****
--- 326,336 ----
    int invert;
    _charset cset;
  
+ #ifdef sparc
+   FETCH0(c);
+ #else
    FETCH(c, (char *) 0);
+ #endif sparc
    switch (c)
      {
      case '^':

These are the contents of the former NiCE NeXT User Group NeXTSTEP/OpenStep software archive, currently hosted by Netfuture.ch.