From: John Conover <john@email.johncon.com>
Subject: Enterprise wide information retrieval system
Date: Fri, 19 May 95 01:28 PDT
Attached is a description of a enterprise wide information retrieval and documentation system (eg., "memo machine," "information robot," "infobot," or "mailbot,") that I have used with good results for many years. It has been used for context management of projects, in geographically, and ethnically disperse organizations, that are involved in the very dynamic electronic component market, (microprocessor market, to be more specific.) The "C" sources to the system are available at no charge by sending an email with the following format: To: info-request@email.johncon.com Subject: archive get rel The sources constitute about 400K bytes, and will be returned to you, by return email, in two sections. (You will be mailing into the system described below, BTW, only in this case, it is functioning as a source repository-so the search functions are disabled.) You may want to skip the 5 numbered paragraphs, below, to get to the description of the conceptual application of the system. There are several attachments that describe the technical evolution of the system. I think it might be interesting reading for those involved in modern organizational theory and informatics-particularly if the market place served is very dynamic, (the revenue's of the microprocessor industry have a time series that is fractal, BTW.) It should be apparent that this system is not meant to replace the traditional content database systems used in MIS organizations, but is complementary to them. The system is capable of operating in a distributed, interoperable client-server environment. The second paragraph of numbered paragraph 5, below, concerning the concept of "moving orthogonal in information space" may be of interest to those involved in OD when doing organizational effectiveness evaluations. john@email.johncon.com (John Conover) -- John Conover, john@email.johncon.com, http://www.johncon.com/ ______________________________________________________________________________ The objective of this application of the rel program, in conjunction with the procmail/smartlist programs, is to construct an enterprise wide, full text information retrieval system that uses the Unix MTA (Message Transfer Agent,) as a delivery, query, and distribution system-in a sense, what is currently termed "groupware." In general, the operational procedure is as follows: 1) The information retrieval database is installed in the database administrator's, (perhaps a project team leader, since installation does not require sysadm privileges,) account, as per the the instructions in the INSTALL file in this directory. Multiple information retrieval systems may be installed, if necessary. 2) All members, perhaps a project team, who will have access to the information system will be entered into the accept and/or reject files, as appropriate, in the smartlist directory for the system's database as per the standard installation in the procmail/smartlist's manual. If the repository is, in addition, to function as a distribution agent of information, the dist file in the smartlist directory should include the email addresses of those on the distribution list, as per the standard installation outlined in the manual. 3) As formal email correspondence occurs between the members of the project team, any mail that is mailed to the information system's account will be saved, and placed under maintenence of the full text database system. Anything pertaining to the project should be submitted to the database system. If the system functions as a distribution agent, all email submitted will also be mailed to the distribution list, which could be other team members, management, etc. These in turn, can be replied to, and the replies will be placed under the maintenance of the database system, and distributed to all others on the distribution list-ie., it is an asynchronous conferencing system, or an electronic "mailing list." 4) However, it differs from conventional electronic mailing lists, in that it has a "sister account," which can receive mail, and can be used to query the database for previous email concerning issues that have been addressed, (or, perhaps not addressed.) The program rel is used for the query, so a "context can be framed," using the rel search criteria. Any email found that concerns the "framed context," is returned to the user via email, as a MIME compliant digest. The order of the emails in the digest are in order of relevance to the "context" of the query. In some sense, the digest is like an electronic administrative folder that is maintained with correspondences concerning specific issues. The major difference is that the folder can be created dynamically, on demand. 5) It turns out, that with a modern MUI (Mail User Interface,) like the GNU emacs' rmail facility, that very powerful informatic operations can be performed. For example, the email digest returned from the operation in 4) above is in a MIME digest, in order of relevance to a "context" query. The digest can be "burst" into individual messages, and, starting with the most relevant message, the messages can be browsed to find a specific entry, or the context of group of messages. If a specific message is of interest, it can be replied to, to re-table an issue, or request additional information, etc. There is one other very powerful operation that can be performed. Suppose that when examining the messages in order of relevance, some interest is generated in one of the messages, and it is desired to find out more about the "context" of the message. The messages can be rearranged, and ordered by the author to further find out the the "context" of the messages from the author's point of view. The messages can also be rearranged by date, so that a short "context" window can be derived around the temporal issues of the message. These operations are termed "moving orthogonally in information space," eg., you were moving through the documents in order of relevance, then investigated the concerns of a specific author, then investigated the author's concerns in relation to the rest of the group over a period of time, etc. Note that the process outlined in sections 3), 4), and 5), above, really constitutes nothing more than an electronic literature search, and is a similar process to what a historian would do when researching a subject, (presumably because an understanding of the "context" of the subject was desired.) The only difference is that the process is highly automated. The proposed system can be thought of as an electronic filling cabinet that can be searched, electronically. The concept is not new-it was proposed by Vanavar Bush in the 1940's, (the Memex Machine,) and later modified by Douglas Engelbart in the early 1970's. Note that what is being proposed is a new administrative paradigm-one that addresses context as opposed to traditional content issues in organizations, and is compatible with the contemporary concepts of "Empowerment," and "Total Quality Management." Administration is the mechanization of the flow information through an organization. What is being proposed is to use an "information machine," ie., computer, to search, collate, and distribute information. In some sense, it is a "memo machine" that can transcend organizational and parochial boundaries. Note that the memos do not have to be structured-how information is structured will be specified at query time, not the time of composition of the memo. (Note that it is not a "hypertext" system, since the "links" are constructed, dynamically, at the time of the query-not at the time a document is composed.) A key issue in automating the process is the capability for the uninitiated to create "context" queries that are representative of the information desired-the rel syntax is powerful, intuitive, and easy to use on an operational basis; most managers already understand how to use email, and the rel query syntax is similar to the one used in algebraic calculators-in point of fact, it is identical to the syntax used in calculators from Texas Instruments and Casio, except that numbers are replaced by words, and mathematical operators are replaced by boolean operators-a short hand natural language query. As a concluding remark, note that the system can not be a substitute for good organizational practices and disciplines-as Doug Engelbart stated "if you automate a big mess, you end up with a very fast big mess." There are 3 attachments. The first is a brief introduction of how the system was used in an experimental program, the second is excerpts from the original system development reports, (the system outlined, above, is far simpler than the original,) and the third is an excerpt from the rel program manual. ______________________________________________________________________________ Attachment 1: ______________________________________________________________________________ Attached is a brief synopsis of an asynchronous conferencing system (also known as an information retrieval system, electronic literature search system, or corporate repository,) that I used in cross functional program management, in another life, a long time ago. The objective was to find a methodology to relate the corporate information repository to the management structure, (we did not consider the technical issues to be significant.) The general concept was to add sufficient functionality to the Unix email system to turn it into an electronic literature search system. The attached is a "cut and stick" from some of the reports on the system's development. The project/program team supported by this system consisted of little over a hundred professionals, from approximately 20 specialties, and 4 core corporate functions. They were geographically, and ethnically, disperse. Also, attached is the description a program, rel.c, which was used to perform complex literature search queries on the full text database, and return documents, in order of relevance to the query, (note that hypertext methodologies are incapable of operating in this fashion.) These documents were returned as an email digest, which could be "burst," into constituent documents, allowing the most relevant document to be reviewed first, (and then, if necessary, the specific email could be responded to, for further clarification, etc.) Since most email readers, elm, pine, email, etc., are capable of sorting a mail box by different criteria, one could move "orthogonally in information space," during the review process, (ie., move though the documents by relevance, and then sort by author, to find out what he/she had to say about things, then sort by date, to find out the chronology of events in the discussion, etc.) The database system was a distributed environment, with each segment of the database consisting of less than 10Mbyte, so queries were done in parallel, using all network resources, and, thus, very fast. In point of fact, all of the attachments are a "cut and stick" from documents fetched from the full text database system with command "rel ((information & retrieval) | (literature & search) | (corporate & repository) & management)" ______________________________________________________________________________ Attachment 2: ______________________________________________________________________________ Various "cut and sticks" from the development reports. Information systems are used in program management, which must coordinate the various activities of the corporate functions (ie., engineering, marketing, sales, etc.) involved in development projects. After researching the issues, (see below,) We concluded that a distributed full text system that uses the mail (MTA) system as a communication medium is the desirable direction to pursue. Our reasoning is as follows: 1) The Unix MTA is almost universal, and will operate effectively over uucp and/or ethernet connectivities in a non-homogeneous hardware environment. 2) Each transaction is logged, with a date/time stamp, and who created the transaction. 3) The MTA already has remedial file storage capabilities, which can be used to query/respond to transactions at a later date. 4) Most(?) computers are already connected together, and users are familiar with how to use the system. 5) The MTA database can be NFS'ed to conserve machine resources. 6) It is a text based system. We discounted the "hyper text" type of systems, because the links must be established before the document is stored-which is fine if you know what you are going to query for. In a general management application, this is seldom the case. We set up a prototype system, using the following (readily available) programs: 1) elm, because it has a slightly more sophisticated file storage structure, and a very powerful aliasing capability that can alias team members as a group. Additionally, it has limited query capabilities, and can, through its forms capabilities, send mail transactions in a structured format. (Which is advantageous if the transactions are used for notification of schedule milestone completion, etc.) Eudora was used on the PC's and MAC's, using POP3 as the communications environment between the PC's and the Unix MTA. 2) The dbm library to build an extensible hash query system into the file storage structure made by elm. This was operated in two ways, by an RPC direct call, and a mail daemon that "read" incoming mail (to a query "account") and returned (via mail) all transactions that satisfied boolean conditionals on requested words. (A data dictionary was added later, so that the dictionary could be scanned for matches to regular expressions, which were then passed to the extensible hash system, but for some reason, this was seldom used.) The query was made through a very simple natural language interface, ie., send john and c.*r not January would return all transactions containing john, excepting those written in January. (We did not attempt phrases, it looked complicated-this is ill advised by Tenopir, etc. below.) This program contained approximately 350 lines of C code. A soundex algorithm was added later to overcome spelling errors-the full text database contained the soundex of the words in a document, and any words searched for were converted to soundex prior to the query. (See the works by Knuth for details of the soundex algorithm.) Also a parser was added so that the boolean search words could be grouped in postfix expressions, eg., ((john & conover) ! (January | march)). The order that the documents were returned in is in order of relevance. This prototype was well received, and was used as follows: 1) Management "decreed" that the system would be used as a management tool, and all data had to be entered, or transcribed into the system (including the minutes of meetings, etc.) If it didn't exist in the system, it did not exist. All discussions, and reasons for decisions had to be placed in the system. ALL team members and upper management had identical access to ALL transactions. (Mail could be used for private correspondence, such as politicking, etc. but all decisions, and the reasons for the decisions had to be placed in the system.) The guiding rule was that at the end of the project, the system contained a complete play by play chronology and history of all decisions, and reasoning concerning the project, and, by the way, who was responsible for the decisions. On each Monday, everyone entered into the system, his/her objectives for the week, and when each objective was finished, she/he mailed the milestone into the system-ie., all group members and management could thus find out the exact status of the project at any time (ie., a "social contract" was made with management and the rest of the members of the team.) In some sense, it is really nothing more than an automated, real-time MBO system. At any time, a discussion could be initiated on problems/decisions in the system by anyone. The project manager was assigned the responsibility of "moderator," or chair person for his/her section of the project. Each Friday, the system was queried for project status, and the status plumbed to TeX for formating, and printed for official documentation. This document was discussed at a late Friday people-to-people staff meeting. (The reason for setting things up this way can be found in Davido, below.) 2) Marketing was responsible for acquiring all market data on magnetic media, (from services like Data Quest, the Department of Commerce, etc.) and each document was "mailed" into the system so that the information was available for retrieval by anyone. All had access to the progress made by engineering, and can contribute information on issues as the program develops-ie., this was a "concurrent engineering" environment. 3) Engineering was responsible for maintaining schedules, and reflecting those schedules in the system-if slippages occurred the situation could be addressed immediately by management, and a suitable cross functional resolution could be arrived at. 4) Sales was responsible for adding customer inputs, concerning the project, into the system, so customer definitions could be retrieved by all project members. This included the customer data, such as who has buying authority in the customer's organization, who has signature, etc. The results were very impressive not only by productivity standards, but also by "correctness to fit and form" standards (ie., the right product was in the market at the right time, the first time.) This has becoming a central agenda, as outlined in Davido, below. Bibliography: "Computer-Supported Cooperative Work," Irene Greif "A model for Distributed Campus Computing," George A. Champine "Enterprise Networking," Ray Grenier and George Metes "Connections," Lee Sproull and Sara Kiesler "5th Generation Management," Charlse M. Savage "Intellectual Teamwork," Jolene Galegher, Robert E. Krout and Carmen Egido "In the Age of the Smart Machine," Shoshana Zuboff "The Virtual Corporation," William H. Davido and Michael S. Malone "Accelerating Innovation," Marvin L. Patterson "Paradigm Shift," Don Tapscott and Art Caston "Developing Products in Half the Time," Preston G. Smith and Donald G. Reinertsen "Full Text Databases," Carol Tenopir and Jung Soon Ro "Text and Context," Susan Jones ______________________________________________________________________________ Attachment 3: ______________________________________________________________________________ Rel is a program that determines the relevance of text documents to a set of keywords expressed in boolean infix notation. The list of file names that are relevant are printed to the standard output, in order of relevance. For example, the command: rel "(directory & listing)" /usr/share/man/cat1 (ie., find the relevance of all files that contain both of the words "directory" and "listing" in the catman directory) will list 21 files, out of the 782 catman files, of which "ls.1" is the fifth most relevant-meaning that to find the command that lists directories in a Unix system, the "literature search" was cut from 359 to 5 files, or a reduction of approximately 98%. The command took 1 minute and 26 seconds to execute on a on a System V, rel. 4.2 machine, (20Mhz 386 with an 18ms. ESDI drive,) which is a considerable expediency in relation to browsing through the files in the directory since ls.1 is the 359'th file in the directory. Although this example is remedial, a similar expediency can be demonstrated in searching for documents in email repositories and text archives. General description of the program: This program is an experiment to evaluate using infix boolean operations as a heuristic to determine the relevance of text files in electronic literature searches. The operators supported are, "&" for logical "and," "|" for logical "or," and "!" for logical "not." Parenthesis are used as grouping operators, and "partial key" searches are fully supported, (meaning that the words can be abbreviated.) For example, the command: rel "(((these & those) | (them & us)) ! we)" file1 file2 ... would print a list of filenames that contain either the words "these" and "those", or "them" and "us", but doesn't contain the word "we" >from the list of filenames, file1, file2, ... The order of the printed file names is in order of relevance, where relevance is determined by the number of incidences of the words "these", "those", "them", and "us", in each file. The general concept is to "narrow down" the number of files to be browsed when doing electronic literature searches for specific words and phrases in a group of files using a command similar to: more `rel "(((these & those) | (them & us)) ! we)" file1 file2` Although regular expressions were supported in the prototype versions of the program, the capability was removed in the release versions for reasons of syntactical formality, for example, the command: rel "((john & conover) & (joh.*over))" files has a logical contradiction since the first group specifies all files which contain "john" anyplace and "conover" anyplace in files, and the second grouping specifies all files that contain "john" followed by "conover". If the last group of operators takes precedence, the first is redundant. Additionally, it is not clear whether wild card expressions should span the scope multiple records in a literature search, (which the first group of operators in this example does,) or exactly what a wild card expression that spans multiple records means, ie., how many records are to be spanned, without writing a string of EOL's in the infix expression. Since the two groups of operators in this example are very close, operationally, (at least for practical purposes,) it was decided that support of regular expressions should be abandoned, and such operations left to the grep(1) suite. Applicability: Applicability of rel varies on complexity of search, size of database, speed of host environment, etc., however, as some general guidelines: 1) For text files with a total size of less than 5 MB, rel, and standard egrep(1) queries of the text files will probably prove adequate. 2) For text files with a total size of 5 MB to 50 MB, qt seems adequate for most queries. The significant issue is that, although the retrieval execution times are probably adequate with qt, the database write times are not impressive. Qt is listed in "Related information retrieval software:," below. 3) For text files with a total size that is larger than 50 MB, or where concurrency is an issue, it would be appropriate to consider one of the other alternatives listed in "Related information retrieval software:," below. References: 1) "Information Retrieval, Data Structures & Algorithms," William B. Frakes, Ricardo Baeza-Yates, Editors, Prentice Hall, Englewood Cliffs, New Jersey 07632, 1992, ISBN 0-13-463837-9. The sources for the many of the algorithms presented in 1) are available by ftp, ftp.vt.edu:/pub/reuse/ircode.tar.Z 2) "Text Information Retrieval Systems," Charles T. Meadow, Academic Press, Inc, San Diego, 1992, ISBN 0-12-487410-X. 3) "Full Text Databases," Carol Tenopir, Jung Soon Ro, Greenwood Press, New York, 1990, ISBN 0-313-26303-5. 4) "Text and Context, Document Processing and Storage," Susan Jones, Springer-Verlag, New York, 1991, ISBN 0-387-19604-8. 5) ftp think.com:/wais/wais-corporate-paper.text 6) ftp cs.toronto.edu:/pub/lq-text.README.1.10 Related information retrieval software: 1) Wais, available by ftp, think.com:/wais/wais-8-b5.1.tar.Z. 2) Lq-text, available by ftp, cs.toronto.edu:/pub/lq-text1.10.tar.Z. 3) Qt, available by ftp, ftp.uu.net:/usenet/comp.sources/unix/volume27. ______________________________________________________________________________