NformatiX: The Htmlrels Program

Software For Full Text Information Retrieval:

The Htmlrels Program - Order The Relevance Of HTML Text Documents To A Phonetic Search Criteria

htmlrelx [-n N] [-v] patterns paths ...

DESCRIPTION

Htmlrelx is a modification to the rel(1) program-the output file list format has been altered to be compatible with Netscape's level 1 bookmark file syntax. It was designed to be used with the program wget(1), (the wget(1) program is freely available from GNU/FSF by anonymous ftp to ftp://prep.ai.mit.edu/pub/gnu/.) A typical usage of the wget(1) program would be:


  wget -nc -r -l2 -H -x -t3 -T15 -P www -Ahtml,htm http://my.favorite.url

This would create a directory structure, www, on the local machine containing all of the available html documents, (and the documents linked to them, nested two deep,) starting with the specified URL.

The wget(1) program may be used to fetch http pages from a bookmark file, (and the documents linked to them, nested two deep,) www.html, also:


  wget -nc -r -l2 -H -x -t3 -T15 -P www -Ahtml,htm -i www.html

The program htmlrelx can then be used to make a bookmark file of the html documents, arranged in order of relevance to a specified criteria:


  htmlrelx criteria * > ../www.html

or, if only the 10 most relevant documents are desired:


  htmlrelx -n 10 criteria * > ../www.html

The documents can be reviewed with Netscape, Mosaic, or Lynx, for example:


  lynx www.html
  Mosaic www.html
  Netscape www.html

SEARCH STRATEGIES

The shell script wgetrels(1) uses the wget(1) program in conjunction with htmlrelx(1) to provide a flexible and extensible Internet HTML Web page search tool. One of the advantages of relevance searching on the Internet is that the the search of HTML links can be controlled by the relevance of information contained in the HTML pages. This can be an iterative process, for example:


  wget -nc -r -l1 -H -x -t3 -T15 -P www -Ahtml,htm http://my.favorite.url

would "seed" the html page directory, www, with pages from the URL, http://my.favorite.url. Note that a search "context" has already been specified; doing a search, for example on game theory, by specifying a keyword of "game" to an Internet search engine would not produce the desired results. However, if the URL, http://my.favorite.url, was the Web pages for an economics department at a university, the "context" would be entirely different.

The next iteration of the search, going down another level in the hierarchy of the links might be:


  cd www
  htmlrelx criteria * > ../www.html
  cd ..

and the search iterated:


  wget -nc -r -l1 -H -x -t3 -T15 -P www -Ahtml,htm -i www.html

where the file www.html is a list of the URL's containing information, in order of relevance, as specified in the criteria arguments to htmlrelx(1). Since the URL's are ordered by relevance, the most "promising," (ie., the documents with the best probability of containing the information that is being searched for,) the file, www.html, can be trimmed, say, to 10, URL's:


  cd www
  htmlrelx -n 10 criteria * > ../www.html
  cd ..

and the search iterated:


  wget -nc -r -l1 -H -x -t3 -T15 -P www -Ahtml,htm -i www.html

which would descend the search another level in the link hierarchy from http://my.favorite.url.

Alternatively, the file, www.html, can be edited, and reordered, (in an interactive fashion with each search,) with any popular browser to enhance the search direction and capability. Note that the search criteria can be altered in the process, and, since the Web pages are stored on the local machine, can be viewed, "off line." Note, also, that the programs wget(1) and htmlrelx(1) are "portable," so the actual search can use a host that has a direct high speed connection to the Internet-and the file, www.html, transfered back to the local machine.

One of the issues in searching the Internet, is that the the number of HTTP links that need to be searched increases exponentially with the number of HTTP pages that have already been searched-if the number of pages in the directory, www, are increasing exponentially, it is probably appropriate to constrain the search through alteration of the search criteria used for htmlrelx(1). (There are about three links, on average, on every HTML page.)

For exhaustive searches, the depth, (the -l argument to both wget(1) and wgetrels(1),) can be increased. For general searching, a depth of 3 will usually suffice, and only one iteration will be required. Typically, this will reduce the search time for specific information by approximately an order of magnitude.

OPTIONS

-n N: Output a maximum of N many http descriptors.
-v: Print the version and copyright banner of the program.

WARNINGS

In the interest of performance, Memory is allocated to hold the entire file to be searched. Large files may create resource issues.

The "not" boolean operator, '!', can NOT be used to find the list of documents that do NOT contain a keyword or phrase, (unless used in conjunction with a preceeding boolean construct that will syntactically define an intermediate accept criteria for the documents.) The rationale is that the relevance of a set of documents that do NOT contain a phrase or keyword is ambiguous, and has no meaning-ie., how can documents be ordered that do not contain something? Whether this is a bug, or not, depends on one's point of view.

DIAGNOSTICS

Error messages for illegal or incompatible search patterns, for non-regular, missing or inaccessible files and directories, or for (unlikely) memory allocation failure, and signal errors.

AUTHORS

A license is hereby granted to reproduce this software source code and to create executable versions from this source code for personal, non-commercial use. The copyright notice included with the software must be maintained in all copies produced.

THIS PROGRAM IS PROVIDED "AS IS". THE AUTHOR PROVIDES NO WARRANTIES WHATSOEVER, EXPRESSED OR IMPLIED, INCLUDING WARRANTIES OF MERCHANTABILITY, TITLE, OR FITNESS FOR ANY PARTICULAR PURPOSE. THE AUTHOR DOES NOT WARRANT THAT USE OF THIS PROGRAM DOES NOT INFRINGE THE INTELLECTUAL PROPERTY RIGHTS OF ANY THIRD PARTY IN ANY COUNTRY.

So there.

Comments and/or bug reports should be addressed to:

john@email.johncon.com

http://www.johncon.com/

http://www.johncon.com/ntropix/

http://www.johncon.com/ndustrix/

http://www.johncon.com/nformatix/

http://www.johncon.com/ndex/