new program

From: John Conover <john@email.johncon.com>
Subject: new program
Date: Thu, 5 Jan 95 01:05 PST

Hi Rick. I'm working on a toy. Ran for the first time about 15 min.
ago. One of the issues we had a S-MOS with full text info retrieval
databases was that the query lang. was too sophisticated for most
users (like executives!) So I wrote a unix filter that uses infix, so
you can do things like search for those documents that contain both
word1 and word2 with a statement like "(word1 & word2)."  And, not,
and or are the supported booleans. Nice, simple TI calculator syntax.

The program does an infix to postfix stack translation, searches
through a directory hierarchy (POSIX compliant) for documents (files)
that contain the requested words using a Boyer-Moore-Horspool-Sunday
search and evaluating the postfix stack on the frequency count of the
requested words in each document. A list of those files containing the
words, with the frequency count, is made, then reverse sorted using a
linked list qsort on the frequency, and this list dumped to the screen
in reverse order, so you can do something link more `rel "(abc & def)"
mydir` and it will bring the files, in order of relevance, into the
pager.

For example, I just searched the man directories in johncon, and tried
to find ls(1) by searching for "(directory & listing)," like I didn't
know what the command was. There were 691 files searched, for a total
of 6.2Meg characters. It took 42 seconds at 58% as per the time
command (johncon is a 20 MHz. 386 with an 18 ms. drive.) The resultant
file list was 132 files, of which ls(1) was 5th from the top. So, I
would have cut my literature search from 691 files to 5, or by about
99%.  Kind of interesting.

I had to use double level indirection in qsorting the linked list of
files, and under some kind of circumstances, a core comes falling out
of the sky, (I just finished writing it about 30 min. ago-I didn't
even debug it.) but other than that, it is probably a pretty good
prototype. The searching is faster than agrep when doing multiple word
searches, and it will probably have adequate performance for databases
with up to 10Meg, total.

I'm writing it to replace wais in johncon which I use for handling all
of my email. Just too much maintenence on the inverted index database
and too hard to use the language-no partial keys, grouping operators,
and the database requires as much disk space as the documents,
etc. What do you think about the infix query heuristic?

        John

--

John Conover, john@email.johncon.com, http://www.johncon.com/
Last modified: Fri Mar 26 18:57:58 PST 1999 $Id: 950116012211.2166.html,v 1.0 2001/11/17 23:05:50 conover Exp $