Writing HTML files without a validating parser is like trying to write computer programs without a compiler: don't do it! Fortunately, help is readily available on the Internet.
James Clark <[email protected]> is developing a new implementation of a suite of SGML parser tools, called SP. These include:
nsgmls
sgmls
-compatible validating SGML
parser.
spam
sgmlnorm
Besides being a complete redesign of the earlier successful
smgls
implementation, the new programs are
designed for the future: they support extended character
sets, such as Unicode, and various multi-byte encodings used
in oriental languages.
The new code is written almost entirely in C++ (about 45K lines at version 0.4, or twice the size of Don Knuth's TeX or Metafont), and requires template support, a relatively new feature of C++ which is not yet widely available.
WARNING: To build these programs, you will need about 50MB of disk space, unless you remove the default -g compiler option. Doing so reduces the executable sizes from almost 10MB each to about 1MB (on a Sun SPARC Solaris 2.3 system). Alternatively, you can build them, then run the UNIX strip command on the executables to remove debug symbols.
The SP
programs can be compiled and
built using recent releases of
GNU g++
(2.6.1 or later). g++
itself is built as part
of the GNU gcc
compiler installation; although
that installation takes a few hours, and requires about
120MB of disk space to be able to run the validation tests
before installation, it is straightforward, and should be
problem free on most current UNIX systems. The GNU compiler
suite has also been built on IBM PC MS DOS and DEC OpenVMS
systems, although those versions usually lag behind.
The SP
distribution site has binaries for Sun
SunOS and Solaris, so if you have such a system, you may not
need to build any of the SP
code from scratch,
or to install g++
.
Just as with
sgmls
,
lengthy command lines are needed to run these programs
successfully. To facilitate their use, I've prepared simple
UNIX shell scripts
html-ncheck
and
html-spam
to hide the complexity, so that only the HTML files need to
be provided on the script command lines.
If you have installed the html-check
distribution, and you want to use html-spam
,
you need to add to end of the HTML catalog file,
/usr/local/lib/html-check/lib/catalog.
these lines:
-- Added at the suggestion of James Clark <[email protected]> -- -- so that spam -p doesn't output the contents of html.decl -- SGMLDECL html.decl
Without this change, the contents of html.decl
are copied to the output if the -p is included
in the spam
invocation in html-spam
; omitting -p
and including
html.decl doesn't help, because the <!DOCTYPE ...
> line is then lost.
I have successfully built sp-0.4
with g++
( gcc
2.6.3) on Sun Solaris 2.3 and 2.4
without any problems at all.
Build attempts on other architectures have been less
successful, though with hand tweaking, most have been made
to succeed. My findings are summarized in the remaining
sections.
An initial build attempt failed on this system with missing externals at link time.
Correspondence with the author of SP
led me to
completely remove the installed libraries and include files
for gcc
and libg++
, and then do a
fresh rebuild of both systems. I have gone through several
generations of the GNU C compiler and C++ library on this
system, without having to do this in the past. In any
event, this completely fixed the problems, and the SP
build completed successfully with no modifications
whatever to the Makefile or source code, and
the make check validation tests were all
successful.
An initial build attempt failed on this system, so just as
on the DEC Alpha OSF/1 system, gcc
and
libg++
were completely removed from the system, and
then reinstalled. The Makefile was modified to
set RANLIB=ranlib, and the build of SP
then completely successfully.
An initial build attempt failed on this system, so just as
on the DEC Alpha OSF/1 system, gcc
and
libg++
were completely removed from the system. I
was able to build gcc 2.6.3
successfully, and
the stage1
and stage2
validation
tests passed successfully. However, the build of
libg++ 2.6.2
failed because of bad code (an undefined
branch target symbol) generated by gcc
for
libstdc++/cMAIN.s
. Similar errors showed up in
SP 0.4
when I tried to use g++
anyway.
I then searched the gnu.g++
newsgroup for
problem reports on this architecture, and found that a
patched Hewlett-Packard GNU tools distribution is available
locally on ftp://jaguar.cs.utah.edu/dist. While
I know some of the people involved in that distribution, in
the past, it has lagged the FSF releases, so I had not
thought to look there again.
I therefore fetched the hpuxbin.tar.Z
distribution file, unbundled it, and installed it on our HP
systems. Fortunately, this solved all of the problems.
Although the distribution includes the installed
libg++ 2.6.2
files, I was curious to see whether the
patched compiler would build libg++
properly.
It does, although I had to add a -lm library
switch to the libg++/gperf/src/Makefile.
SP 0.4
now builds completely correctly on this
system with the patched gcc 2.6.3
, and the
make check validation tests all pass.
I also tried to build SP
with the native HP-UX
C++ compiler. I made these changes to the top-level
Makefile
:
CXX=CC WARN= DEBUG= DEFINES=-DANSI_CLASS_INST $(XDEFINES) CXXFLAGS=$(DEBUG) $(WARN) CC=c89Compilation failed at parser/parseDecl.C:
% CC -I. -I./../lib -DANSI_CLASS_INST -c parseDecl.C CC: "parseDecl.C", line 1497: internal error: Extract/deposit of 0 for field size in EXTRS not in [1..32] (7103)and in many subsequent files with messages of the type
% CC -I. -I./../lib -I./../parser -I./../em -I./../app -I./../api \ -DANSI_CLASS_INST -c sgmlnorm.C CC: "NormMessages.H", line 13: error: domain_ undefined (1285) CC: "NormMessages.H", line 13: error: no match for call: const MessageType0* -> MessageType0:: MessageType0( const enum MessageType::Severity, any , char [48]) (1735) CC: "NormMessages.H", line 13: choice of MessageType0::MessageType0()s: CC: "NormMessages.H", line 13: MessageType0::MessageType0(MessageType::Severity, const char *, const char *); CC: "NormMessages.H", line 13: MessageType0::MessageType0(const struct MessageType0&); CC: "NormMessages.H", line 13: (no usable const member function) ...I gave up at this point on the HP-UX native C++ compiler installation attempt.
SP 0.4
is not yet installable on this system.
A few source code patches were required to overcome errors
in the IBM C++ header files, and restrictions of the IBM
make
utility; they are available as a
separate file.
gcc 2.6.3
produces compilation errors in
em/em_inst.C: the compiler generates branches in the
assembly code to undefined symbols. This same error was
seen in the Hewlett-Packard version of gcc 2.6.3,
but has been fixed in the patched version noted
above.
At the suggestion of Joachim Schrod, I removed the -O compiler option, and recompiled the source code: this eliminated the undefined symbol problem.
I used make -i
to force the build past the
compilation errors. However, this resulted in massive
numbers of undefined symbols at link time, when the symbols
are in fact defined in the SP
libraries.
Compilation with the IBM native xlC
compiler
was successful, but the same kinds of link errors appeared
as happened with the g++
build. I suspect an
IBM linker error is the culprit. My suspicion, based on the
error messages produced, is that the linker does not
correctly handle the mangling of C++ names containing
templates.
An initial build attempt failed on this system, so just as
on the DEC Alpha OSF/1 system, gcc
and
libg++
were completely removed from the system. I
was able to build and install gcc 2.6.3
successfully, and the stage1
and stage2
validation tests passed successfully.
However, the build of libg++ 2.6.2
initially
failed because of some minor problems that I was able to
prepare workarounds for, and complete the build and
installation. A report of the problems and their fixes has
been sent to the libg++
developers. If you
require the patches for this system, send e-mail to
[email protected].
I modified the top-level SP
Makefile
to set RANLIB=ranlib. The build of
SP
then completed successfully, and make
check passed all of the validation tests.
An initial build attempt failed on this system, so just as
on the DEC Alpha OSF/1 system, gcc
and
libg++
were completely removed from the system, and
then reinstalled. The build of SP
then
completed successfully with make CC=gcc CXX=g++
, and make check passed all of the
validation tests.
I also tried a build using the vendor-provided C++ compiler, but it failed utterly because of the lack of template support in the compiler.
The trick to a successful build of SP
on this
system is to make the simple patch in lib/assert.C
described in the Silicon Graphics section above, and
to make these Makefile changes:
LIBS=-liostream -lg++ $(XLIBS) RANLIB=ranlib
Once these are done, the build will complete successfully, and so will the validation suite: make check.