From: John Conover <john@email.johncon.com>
Subject: new program
Date: Thu, 5 Jan 95 01:05 PST
Hi Rick. I'm working on a toy. Ran for the first time about 15 min. ago. One of the issues we had a S-MOS with full text info retrieval databases was that the query lang. was too sophisticated for most users (like executives!) So I wrote a unix filter that uses infix, so you can do things like search for those documents that contain both word1 and word2 with a statement like "(word1 & word2)." And, not, and or are the supported booleans. Nice, simple TI calculator syntax. The program does an infix to postfix stack translation, searches through a directory hierarchy (POSIX compliant) for documents (files) that contain the requested words using a Boyer-Moore-Horspool-Sunday search and evaluating the postfix stack on the frequency count of the requested words in each document. A list of those files containing the words, with the frequency count, is made, then reverse sorted using a linked list qsort on the frequency, and this list dumped to the screen in reverse order, so you can do something link more `rel "(abc & def)" mydir` and it will bring the files, in order of relevance, into the pager. For example, I just searched the man directories in johncon, and tried to find ls(1) by searching for "(directory & listing)," like I didn't know what the command was. There were 691 files searched, for a total of 6.2Meg characters. It took 42 seconds at 58% as per the time command (johncon is a 20 MHz. 386 with an 18 ms. drive.) The resultant file list was 132 files, of which ls(1) was 5th from the top. So, I would have cut my literature search from 691 files to 5, or by about 99%. Kind of interesting. I had to use double level indirection in qsorting the linked list of files, and under some kind of circumstances, a core comes falling out of the sky, (I just finished writing it about 30 min. ago-I didn't even debug it.) but other than that, it is probably a pretty good prototype. The searching is faster than agrep when doing multiple word searches, and it will probably have adequate performance for databases with up to 10Meg, total. I'm writing it to replace wais in johncon which I use for handling all of my email. Just too much maintenence on the inverted index database and too hard to use the language-no partial keys, grouping operators, and the database requires as much disk space as the documents, etc. What do you think about the infix query heuristic? John -- John Conover, john@email.johncon.com, http://www.johncon.com/