The html_analyzer-0.30 README file
OVERVIEW:
This file contains information outlining the types of processing
performed by the html_analyzer software as well as copyright, disclaimer,
and funding information. Please read the file
Installation in this directory for information on installing the
software. To walk through and example run of the analyzer, see
Example.
AVAILABILITY:
The software is currently being distributed via anonymous ftp from:
ftp.ncsa.uiuc.edu in the /Mosaic/misc direcortory in
compress'd and gzip'd forms. It is also available via anonymous
ftp from: ftp.gvu.cc.gatech.edu in pub/gvu/www/pitkow/html_analyzer.
MOTIVATION:
The intent of the html_analyzer is to assist the maintenance of
HyperText MarkUp Language (HTML) databases. As the number of HTML
databases increases, the potential for hyperlinks that point to files or
servers that no longer exist also increases. This results in the need for
an automated hyperlink validation program. This is exactly what the
html_analyzer does. The program also explores the relationship between
hyperlinks and the contents of the hyperlink.
PROCESSING:
This directory contains the software to perform analysis of HTML
databases. Specifically, the following tasks are performed:
- Extract all hyperlinks (a.k.a. anchors) from all *html files within
a given directory hierarchy. The HREF values are allowed to be either quoted
or not. The following types of hyperlinks are not processed:
- HREF=""
- HREF=" "
- HREF="#foo"
- HREF=#foo
- HREF="telnet"
- HREF=telnet
- HREF="rlogin" and
- HREF=rlogin
Note:Within document hyperlinks are pointless to verify, either the
the hyperlinks goes to the intended section or is does not. The telnet and
rlogin hyperlinks only require that the intended machine is alive. If the
machine is alive, the user must proceed to enter information. Since this
user interaction defeats the automated goal of the software, these access
methods are not processed.
- Create non-html versions of the files. These files are by default
placed in /var/tmp/html_analyzer. These file are used to examine the
relationship between hyperlinks and the contents of the hyperlinks. To
change the location of this repository, place the desired directory as the
last command line argument, e.g.
shell_prompt> html_analyzer . /users/pitkow/swap
Creates the directory /html_analyzer in /users/pitkow/swap and places
the non-html files there.
Note:The path must already exist in order for successful execution. The
html_analyzer creates a directory within this directory; it does not create the
directory itself.
- Validates the availability of the documents pointed to by the hyperlinks.
this test is called validate.
This is accomplished via routines from Mosaic's modified WWWLibrary2.
- Looks for hyperlink contents that occur in the database but are not
themselves hyperlinks (See Example). This
test is termed completeness.
- Look for a one-to-one relation between hyperlinks and the contents of
the hyperlink (See Example). This test
is called consistency.
RATIONALE:
We believe that there ought to exist a one-to-one correspondence between
hyperlinks and the hyperlink's contents, such that every occurrence of the
hyperlink points to only one document ( or section of document). This means
every time a user sees a hyperlink, it will always point to the same section
of a document. It also means that each section of document will only have one
hyperlink pointing to it. We hypothesize that such a correspondence is
necessary to create a clear internal representation in the user of the
connections in the HTML database.
RUNNING:
To run the html_analyzer after it has been installed (Please read the file
Installation in this directory for
information on installing the software), type:
html_analyzer [-val] [-com] [-con] directory [path of repository]
The -val, -com, and -con turn off the validation, completeness, and
consistency tests. Only the name of a directory can be specified to check.
If a directory is specified, all *.html files within the directory hierarchy
will be processed. The path of the temporary repository (default is /var/tmp)
can be used if /var/tmp is full or not desirable. A directory (/html_analyzer)
is created in this directory to store the temporary files generated by execution.
The program does not create the temporary repository.
COPYRIGHT:
The libwww2 directory is the modified WWW library that
accompanies xmosaic-pre4. The libhtmlw directory is also from the prerelease.i
Mosaic was developed by Marc Anderson at the National Center for Super-
Computing Applications. This code is available from ftp.ncsa.uiuc.edu
in the /Web directory. The original WWWLibrary2 library was developed
by Tim Berners-Lee at the European Laboratory for Particle Physics (CERN).
This code is available from ftp.info.ch in the /pub/www/src directory
Please see the file Copyrights in this directory for more information on
the copyrights that exist to these portions of code.
The Regents of the University of Colorado claim copyright on the
other portions of the distribution.
This distribution of the software may be freely distributed, used,
and modified but may not be sold as a whole nor in parts without permission
of the copyright owners of the parts.
DISCLAIMER:
This software is provided as is. The Laboratory for Atmospheric
and Space Physics (LASP) and the author are not responsible for support
of this distribution.
FUNDING:
Development of this software was funded by the NASA Earth Observing
System Project under NASA contract NAS5-32392.
CHANGES:
Version 0.30 from 0.10
- 0) make compliant with current html as implemented by NCSA
- 1) removed memory leak
- 2) removed MOTIF dependencies
Version 0.10 from 0.02
- 0) made Mosaic libbwww-2 and libhtmlw dependent; this means that all
valid Mosaic files are now valid html_analyzer files.
- 1) removed unnecessary temporary files created by extract_links();
extract_links() now loads the skiplists directly.
- 2) enabled validation of other access methods. e.g gopher, wais, etc.
version 0.02 from 0.01:
- 0) converted CHECK_HTML_DB and GET_ANCHORS to c code.
- 1) added verification of relative addressed hyperlinks.
- 2) added one-to-many check of the hyperlink's contents to documents
pointed to (previously: many-to-one check of hyperlinks to the hyperlink's
contents)
- 3) cleaned up
ENHANCEMENTS:
Here's a list of things that could be done to improve the html_analyzer:
- 0) create a program to automatically prune hyperlinks that no longer point
to valid files. This entails some tricky questions as to how automated this
process needs to be. In other words, it might be nice for the user to
have the option of specifying the correct location of the file and have the
software make the changes to the HREFs as needed AS WELL as provide the
user with the option of having the software remove all anchors pointing to
the no-longer existent file. Let me know if your interested in this option,
this seems like the next logical addition to the software.
- 1) add a linked list to the data struct of the skiplist that points to a
list of other files that have the same hyperlink and hyperlink content. This
will enable more sophisticated analysis, e.g. enable option 0) above by producing
a list of files that point to a document for pruning purposes, etc.
- 2) add statistical analysis of the HTML db i.e. number of hyperlinks per
document, number of links to a document, list of files that point to a document,
etc.
- 3) perform empirical study to confirm the hypothesis of the importance on a
one-to-one correspondence between hyperlinks and their content. [I might do this
this fall if time allows].
COMMENTS:
The purpose of this distribution is to further the development of
HTML database creation and maintenance utilities. Comments, questions,
and REVISIONS are indeed welcome.
To be added to the html_analyzer mailing list, mail
[email protected] with the subject: html_analyzer add
James E. Pitkow
Graphics, Visualization and Usability Laboratory
Georgia Institute of Technology
[email protected]