ftp.nice.ch/peanuts/GeneralData/Usenet/news/1989/CSN-89.tar.gz#/comp-sys-next/1989/Jan-Apr/NeXTs-Digital-Library

This is NeXTs-Digital-Library in view mode; [Up]


Date: Sun 27-Dec-1988 17:23:19 From: Unknown Subject: NeXT's Digital Library In article <19728@ames.arc.nasa.gov> mike@ames.arc.nasa.gov.UUCP (Mike Smithwick) writes: >The Digital Librarian is impressive. We searched for the word "celestial" >throughout the works of Shakespeare. It found all 3 entries in what appeared >to be less than 5 seconds. Huh? I'm mystified. What do you mean by ``all 3 entries''? Using the UNIX utility `grep' I found 17. In general I'm very disappointed with the Digital Library, for many reasons detailed below. Notice that while some of the reasons appear to be bugs, hence have a chance to go away, others seem to constitute ``features'', so probably will stay. I start with the ``features''. I'm thankful to Nick Katz (nmk@fine.princeton.edu) for pointing out some of the facts below and motivating me to make a more complete study. UNDESIRABLE FEATURE #1: You can only search for words, not for strings or phrases. This means if to find out where S. wrote ``To be or not to be'', you'd have to wade through thousands of occurrences of ``to'', ``be'', ``or'' or ``not''. But read on. UNDESIRABLE FEATURE #2: Apparently very common words cannot be used as search keys at all-- you get a ``0 found'' response. This is the case with the four words mentioned above. Together with feature #1, this means that the Digital Librarian simply won't locate S.'s most famous quotation. UNDESIRABLE FEATURE #3: The display of occurrences is done in two windows. The top window, a smaller one, consists of one line for each file where the word was found (each file has a scene of a play, or a sonnet, etc.) The line contains the file name (e.g. Coriolanus: 1.4) and the beginning of the file. The latter is completely useless information, as it usually consists of stage directions, etc. I would expect here a context line instead, including the keyword. To actually see the quotations you want, you select a line from the top window; the bottom window shows the corresponding file, centered around the first occurrence of the word in the file. The upshot is that to find a particular quotation, you have to click on every line of the first window to open the corresponding file, then click on ``Find'' before leaving that line (just in case the file contains more than one occurrence). Compare this with the system used in printed and on-line concordances, where you're presented with a list of context lines and can scan it visually for the quotation you're looking for. UNDESIRABLE FEATURE #4: The source text has very low-level formatting commands embedded in it. (Though I guess I should be thankful it's in ASCII files, not in binary files in some proprietary format...) For example, the beginning of <Shakespeare>/Plays/Hamlet/1.1 is something like this: ... {\pard\f0\fs28{\fs48 Hamlet\ }\ \ {\b\fs36 1.1} \ {\i Enter Barnardo [...] }{\b \fs24 BARNARDO} Who's there?\ ... For this text to be used elsewhere than in an ``edit'' file, or even within an ``edit'' file but in a different format, you have to strip all this garbage. The markup should instead be done at a higher level, so global changes are easy to make. For example, using a TeX-like notation (that's what I'm most accustomed with; but SGML or any other markup language would be do equally well): \title Hamlet \endtitle \scene 1.1 \endscene \dir Enter Barnardo [...] \enddir \speak BARNARDO \endspeak Who's there? Now for the bugs: BUG #1: Not all occurrences of a word are found -- far from it. And generally you have no clue of that. I've already mentioned the ``celestial'' fiasco (3 found in 17). If you try ``horse'' the situation is whole thing: Nick Katz pointed out that if you search for ``horse'' you get (among others) a line saying ``And our twelve thousand horse'' (Ant. and Cl.: 3.7), but if you search for ``twelve'' you don't get this same line! The most annoying thing is that the choice of quotations presented doesn't seem to be based on any clear criterion: the 14 ``celestial''s that didn't make into the search seemed to have as much of a right to be there as the three that did! BUG #2: Treatment of plurals, etc. is inconsistent. E.g. searching for ``horse'' and ``horses'' brings up two disjoint sets of occurrences, but one of the occurrences listed under ``horse'' actually says ``horses'' (1 Henry IV: 2.4). In general there seems to be any way to search for words under a prefix (as there seems to be for Webster's, although that doesn't work all the time either -- but that's the subject of another message). Silvio Levy (levy@princeton.edu) >From: jgreely@diplodocus.cis.ohio-state.edu (J Greely)
Date: Sun 27-Dec-1988 21:39:19 From: Unknown Subject: Re: NeXT's Digital Library In article <5037@phoenix.Princeton.EDU> levy@Princeton.EDU (Silvio Levy) writes: [in reply to "Mike of the silly return address" stating that he found "all 3 entries" of a word in the Librarian] >Huh? I'm mystified. What do you mean by ``all 3 entries''? Using >the UNIX utility `grep' I found 17. Yes, boys and girls, the correct phrase is all *indexed* entries of a word. Actually, to be more precise, I should say, "all files for which a word is indexed", since the indexing is at the file level. Indexing in general is a very-beta operation, and the current scheme is listed in the release notes with: This set of tools is not supported. It will change between now and the 1.0 release, but it does give a flavor of things to come. Since the indexing library is at the heart of the lookup problems, simply bear with it until it is replaced by a better scheme. Actually, what's there is very nice. The db library is dbm done right, and the idea behind pword is excellent (although its current reliance on modern english is unfortunate; this is one of the major reasons why the indexing in Shakespeare isn't as good as it could be). I have great hopes that db will eventually find its way out into the world (I'd love to work over everything around here that relies on dbm, and insert db instead. This would probably solve several of our problems with yp). >You can only search for words, not for strings or phrases. >This means if to find out where S. wrote ``To be or not to be'', >you'd have to wade through thousands of occurrences of ``to'', >``be'', ``or'' or ``not''. But read on. This is a combination of things. Do you really want all occurrences of "to"? Quick check shows there to be more than 16000 of them, scattered throughout over 6000 files. Common noise words are eliminated from the index as a design decision. As for the inability to search for a phrase, this is acknowledged as a limitation in the release notes. Also, the above statement is not quite true. You can search for <word> ["and"|"or"|"and not" <word> ...] which, if the words you want are indexed, will narrow the search for you. My stock example is locating the line "Ready, so please your grace" in The Merchant of Venice. Not a very important line, but it stuck in my memory from when we performed the play. The only word that is indexed is "grace", which is occurs 75 times. The one (reasonable) search that will uniquely locate it is "merchant and grace" (Merchant of Venice, Act 4, Scene 1, second line). >Apparently very common words cannot be used as search keys at all-- >you get a ``0 found'' response. This is the case with the four words >mentioned above. Together with feature #1, this means that the Digital >Librarian simply won't locate S.'s most famous quotation. Correct. At present, that quote (as well as several others I've tried) cannot be found from within the Library as is. However, if you know any of the surrounding context, you're better off. I happen to remember that the line comes from Hamlet, and that the quote continues with "that is the question. Whether 'tis nobler...". Searching for "nobler" will return 19 files, while "hamlet and nobler" will return the correct section (Hamlet, Act 3, Scene 1). From there, a Find on "nobler" will put you at the correct location in the file. Mind you, you'll never find "Now is the winter of our discontent made glorious summer by this son of York", unless you know that it's the first two lines of Richard III. Incidentally, Library reports this line as "...son of York", while Quotations claims that it's "...sun of York". Typo, anyone? >UNDESIRABLE FEATURE #3: [indexing stores the first line(s) of the file, rather than the context of the match] Agreed. The context would be more useful, but I don't think this will change. The index is built at the file level, so all it knows is that the word is important enough to be indexed for that file. If it returned context, it would be the context of the first entry, and not necessarily the one you want. >UNDESIRABLE FEATURE #4: [embedded rtf, rather than something brighter] This looks like a feature, since low-level encoding requires less intelligence than full TeX-like macros. Not having any documentation on the Microsoft RTF format, I can't say whether it is capable of more sophisticated (read that, "higher level") formatting. >Now for the bugs: > >BUG #1: [bug, feature, same difference] >Not all occurrences of a word are found -- far from it. I recommend to you the manual page for "pword". This will help clarify how the indexing is currently done. The object is to index all *significant* words, based on the surrounding context. A document with frequent mention of horses is more likely to have "horse" indexed than one where it's only mentioned once. Note that the documentation for pword is slightly out of date, and will hopefully be correct by 0.9 (for the correct options, use "pword One other problem is picking Shakespeare for this discussion. The frequency tables used for the indexing appear to be the Modern English version, rather than one more appropriate for the work. In particular, the stop list does not include noise words like thee, thy, thou, etc., instead indexing them quite heavily ("thou" is indexed 315 times, for example). >BUG #2: >Treatment of plurals, etc. is inconsistent. Words are "singularized", but no mention is made of the technique used. It is quite likely that the method currently used isn't as bright as one might hope. Now, to toss in a few of my own (my complete list is a bit too large to post, so I'll limit myself to a few things you didn't mention about the Library): 1) A lower-case search string will perform a case-insensitive search, while an upper-case character will force an exact match. Nice in theory, but it doesn't work. Searching (in Shakespeare) for "Merchant and grace" will return all 75 matches for "grace", while "merchant and grace" will return the unique match that I'm looking for. 2) There is no way to pull up an arbitrary file into the Library, except as the result of a search. For example, if Act 4, Scene 1 comes up as the result of a search, I cannot simply proceed to the Scene 2 if I wish to continue reading. I can open an Edit window containing it, but I can't pull it into the Library unless I can match it with a search. This is the most serious limitation of the program for me, and the one I most want to see changed by 1.0. I want to be able to browse through the files contained in the current database, without leaving the Library. 3) The target field is shared by the Search and Find buttons, but not by the Open button, which instead pulls up a Browser window. Better yet, Search understands multiword targets, while Find will attempt to match the literal string. So, I cannot click Search, and then expect Find to locate the search string within the selected file, unless the search string was a single word. The inconsistent use of the target field is confusing. 4) Printing is useless. An RTF document printed from the Library will have no margins, and will be silently clipped on the way to the printer. If you want to print, you currently have to call up Edit on the current file. -=- J Greely (jgreely@cis.ohio-state.edu; osu-cis!jgreely) "Who is it *this* time?" "Concert promoters who have gone broke organizing charity benefit concerts. We call it Aid Aid." >From: ali@polya.Stanford.EDU (Ali T. Ozer)
Date: Sun 28-Dec-1988 18:12:53 From: Unknown Subject: Re: NeXT's Digital Library In article <5037@phoenix.Princeton.EDU> Silvio Levy writes: >UNDESIRABLE FEATURE #4: >The source text has very low-level formatting commands embedded >in it. (Though I guess I should be thankful it's in ASCII files, not >in binary files in some proprietary format...) The formatting used for the Shakespeare files is the Microsoft Rich Text Format (usually known as RTF). The NextStep Text class understands RTF; and any program using the Text class should be able to read in and edit RTF files without a problem. (Currently the Text class cannot write out RTF; but it will be able to in 0.9.) You can use Edit, the cut-and-paste editor in the Apps directory, for reading in RTF files and stripping the RTF info off. If you double-click on the file name in the Librarian, Edit will be launched and the specified file will be read into a new window, formatted correctly and with the various fonts as indicated by the RTF instructions. If you wish to strip the RTF commands off, create a new Edit window, then cut the desired text from the first window and paste it into the second. Edit windows by default are mono-font, so the RTF info is automatically stripped during the paste. You can make an Edit window accept RTF by selecting "Make RTF" from the menu. Ali Ozer, NeXT Developer Support aozer@NeXT.com >From: ken@gatech.edu (Ken Seefried iii)
Date: Sun 29-Dec-1988 04:06:38 From: Unknown Subject: Re: NeXT's Digital Library In article <5820@polya.Stanford.EDU> aozer@NeXT.com writes: [strip RTF from files in the Library by launching Edit, and cut-and-pasting into a new edit window] Well, this is useful, but for those of us who want to strip RTF from an arbitrary file without the overhead of starting Edit, the filter /bootdisk/NeXT/System/Searcher/rtf-ascii is more fun (not perfect, but more fun). -=- J Greely (jgreely@cis.ohio-state.edu; osu-cis!jgreely) "Who is it *this* time?" "Concert promoters who have gone broke organizing charity benefit concerts. We call it Aid Aid." >From: jeff@stormy.atmos.washington.edu (Jeff L. Bowden)

These are the contents of the former NiCE NeXT User Group NeXTSTEP/OpenStep software archive, currently hosted by Marcel Waldvogel and Netfuture.ch.