|
Software For Full Text Information Retrieval: |
|
Home |
Home | Installation | Usage | FAQs | Utilities | Architecture | QA | Mailing List | License | Author | Download | Thanks
Rel is a suite of programs and tools for building wide area full text information retrieval systems. The search mechanisms are capable of sorting documents by relevance to keyword search criteria. Boolean operations (and, or, not, and grouping operators,) on multiple keywords (which can be interior keywords,) are fully supported and the programs are capable of phonetic keyword search. The programs find application in enterprise wide area information retrieval, (knowledge based,) systems. This is the home page for the rel programs. The rel.tar.gz tape archive contains the C sources for the programs and applications. The archive consists of the programs:
Included in the sources is an example application of an organization/corporate wide knowledge based system that uses Procmail/SmartList to maintain corporate e-mail archives, (as an example, this system is used for the Stochastic UCE Detection, E-mail "Received: " Header IP Address Auditing, and Quarantine Attachments database maintenance.) Sophisticated queries can be submitted to the system using any mail user agent as the user interface. Documents are returned, in order of relevance, in standard digest, (RFC 1153), format. Boolean operations (and, or, not, and grouping operators,) on multiple keywords, (which can be interior keywords,) are fully supported, as is phonetic keyword search. The searching, digest formating, and assemblage of the returned documents are controlled by a single SmartList script for flexibility and extensibility. The wgetrel suite of programs are an example application of the use the htmlrel programs to control the direction of search across the Internet as determined by the relevance of documents found at a site, and finds use in business intelligence operations. The suite uses the Wget program as a "page fetch engine" in a shell script for flexibility and extensibility. The rel program suite was first announced in 1994. There is additional information, regarding the concepts of document relevance searching , in the Architecture page, which also contains a suitable bibliography. Historical PerspectiveThe concept of document relevance searching is not new. It was first proposed by Vannevar Bush (Bush, V. (1941) "Memorandum regarding Memex," [Vannevar Bush Papers, Library of Congress], Box 50, General Correspondence File, Eric Hodgins.) There is a discussion of the Memex concept in the FAQs. The Rel suite was designed in 1993, following discussions:
in the Learning Organization, (LO, Re: Peter M. Senge,) and Business Process Reengineering, (BPR, Re: Michael Hammer & James Champy,) conference/mailing lists, which proposed an operative framework for knowledge based systems and competitive business intelligence. The design of the Rel suite was heavily influenced by the concepts of Vannevar Bush's Memex Machine. (The 1993-09-14 correspondence contains theoretical implications of using a Boolean not operator in search criteria-an implication that Bush was, intuitively, well aware of; see item 1996-10-22 in the FAQs page.) The design of the Rel suite was heavily influenced by the Qt text information retrieval system Unix Shell script from 1991. Mathematical AnalysisMost full text information retrieval systems determine the relevance of documents using a word incident rate search criteria. This actually has formal merit-its a heuristic implementation of Bayesian Inference, which is based on formal probability and combinatorial theory, and often used in Natural Language processing. Let
where the terms If there is no incidence of any one word in the search criteria, then the probability that the document is relevant is zero-essentially, a logical and operation, which has an intuitive esthetic appeal. But for search criteria word incident rates that are
small-a reasonable assumption for most text; Taking the logarithm of Most full text information retrieval systems drop the formal requirements of working with word incident rates in documents, and simply use the word incident counts in a document-rejecting the document as non-relevant if any word in the search criteria is absent. As an aside, Baye's Chain Rule expresses a general strategy for constructing search criteria in electronically automated document searching: chose words that will appear fairly uniquely in only relevant documents; and chose many of them. The first is intuitive-the significance of the second is often underestimated. As a closing aside regarding implementation, the standard usage of Baye's Chain Rule is problematic as an algorithmic document search methodology where the word incident rates of most searches is a fraction of a percent per document. An alternative implementation is to use Baye's Chain Rule to calculate false positive rates; for example, assuming the document to be relevant, and calculating the probability that the declaration is a false positive using the Chain Rule-where a smaller chance of a false positive is, intuitively, more relevant. There are more sophisticated alternatives, too. For example, using a word incident rate in a complete archive for the compliment, instead of the incident rates per document. Document search algorithms are, indeed, a fine art. AvailabilityThe rel suite of programs are freely available for Download and distributed as source code, at no charge, and provided under License. Mailing listThere is a mailing list available for users of the rel program suite. To join see the Mailing List page. ArchiveThe NformatiX archive is at http://www.johncon.com/nformatix/archive/. |
Home | Installation | Usage | FAQs | Utilities | Architecture | QA | Mailing List | License | Author | Download | Thanks