An Example of html_analyzer execution


EXAMPLE:

Below is a portion of HTML text extracted from a fictional database. A rudimentary understanding of HTML is assumed. For more information on HTML anonymous ftp to info.cern.ch and look in the /pub/www/doc for the latest files on HTML and related papers. Another really good source of information of HTML can be found via anonymous ftp to ftp.ncsa.uiuc.edu in the /Web/mosaic-papers directory.

Beginning of example document

Guide Demo V.1

List of informative files to view:

<A NAME=1 HREF="http://nsidc1.colorado.edu:1729/u/CIMS/Demo_Description.html"> Description of this demo

<A NAME=2 HREF="http://nsidc1.colorado.edu:1729/u/CIMS/More_info.html"> More Info Frame

<A NAME=3 HREF="http://nsidc1.colorado.edu:1729/u/CIMS/More_info.html"> Free Text Frame

<A NAME=4 HREF="http://nsidc1.colorado.edu:1729/u/CIMS/Even_more_info.html"> Free Text Frame

The first link contains a description of this demo. The second link points to a frame that provides more information. The third link has a different tag that the second link, but points to the same place. The fourth tag is the same as the third tag, but points to a different file.

End of example document

NOTE: for clarity, the following terms refer to the above constructs.

ANCHOR:

LINK: CONTENTS:

For more information on HTML syntax and semantics, see the file: MarkUp.html located on CERN's WWW server.


EXPLANATION OF EXECUTION:

Here's how the software handles the above text.

First, the program calls extract_links( ) recursively on all files that end with "html" to load the skiplist abstract data type (ADT) with information about the hyperlinks. Concurrently, a non-html version of the file is created and placed in the temporary repository (default: /var/tmp/html_analyzer; to change the path of the repository, state the path as the last command line argument).

Second, the program calls validate( ) if the -val command line argument is not present. validate( ) attempts to confirm the hyperlinks extracted from all the files in the directory hierarchy.

Third, the program calls comleteness( ), which looks for occurrences of each anchor's contents in the database that are not used as a hyperlink. This is implemented by performing a grep on the non-html files and listing the files that matched the hyperlink's contents.

Finally, the functions consistent_link and consistent_content( ) are called to find hyperlinks that are pointed to by two different contents and hyperlink content's that point to two different documents. Essentially, this looks for hyperlinks that violate a one-to-one correspondence between the contents of the hyperlink and the anchor itself.


WALK THROUGH OF EXAMPLE:

In the above example, there is no file: /u/CIMS/Demo_Description.html located on nsidc1.colorado.edu, a httpd server listening on port 1729. The first series of tests will discover this and notify the user as such:

VERIFYING LINKS... WWW Alert: HTTP server at nsidc1.colorado.edu:1729 replies: HTTP/1.0 500 Unable to access document. WWW Alert: Unable to access document. WARNING: Failed in checking: http://nsidc1.colorado.edu:1729/u/CIMS/Demo_Description.html With content of: Description of this demo In local file: ./temp/example.html

Next, html_analyzer finds out that the contents used to describe this link occurred elsewhere with out a link. This could have been in another file, but in this case, the string occurred in the text portion of the document. The user is given a list of the file(s) that need to have that tag made into a hyperlink. The output will look something like this:

VERIFYING COMPLETENESS... WARNING: These filenames contain the content: Description of this demo Without a link to: http://nsidc1.colorado.edu:1729/u/CIMS/Demo_Description.html example.html

Next, the user will be informed that more than one hyperlink content is used to describe the link to /u/CIMS/More_info.html on nsidc1. In this case, both the "More Info Frame" and the "Free Text Frame" point to the same file. One of them needs to go. To aid the HTML db maintainers with the task of deciding which content to remove, the software informs the use of the number of occurrences of each in the database. The output would look something like this:

VERIFYING CONSISTENCY OF LINKS... WARNING: Link used inconsistently. HREF: http://nsidc1.colorado.edu:1729/u/CIMS/More_info.html occurs 1 time with content: Free Text Frame as in file: ./temp/example.html, but also occurs 1 time with content: More Info Frame as in file: ./temp/example.html

Next, the user will be informed that the hyperlink that contains "Free Text Frame" points to more than one file. This is easily corrected by changing the contents of the hyperlink that is used least often to another name. The output would look like this:

VERIFYING CONSISTENCY OF CONTENTS... WARNING: Content used inconsistently. CONTENT: Free Text Frame occurs 1 time with href: http://nsidc1.colorado.edu:1729/u/CIMS/Even_more_info .html as in file: ./temp/example.html, but also occurs 1 time with href: http://nsidc1.colorado.edu:1729/u/CIMS/More_info.html as in file: ./temp/example.html


Hopefully, the walk through elucidated the html_analyzer. If not, please feel free to contact me.

James E. Pitkow        
Graphics, Visualization and Usability Laboratory
Georgia Institute of Technology
[email protected]