new2 new2
WWW http://www.math.utah.edu/~beebe

Notes about SP version 0.4

Introduction

Writing HTML files without a validating parser is like trying to write computer programs without a compiler: don't do it! Fortunately, help is readily available on the Internet.

James Clark <[email protected]> is developing a new implementation of a suite of SGML parser tools, called SP. These include:

nsgmls
-- an sgmls -compatible validating SGML parser.
spam
-- an SGML tag normalizer. This program is capable of supplying missing end tags, expanding abbreviated tags, and also of filling in all known attributes in every begin tag. The latter can be handy if you want to find what attributes are available, without having to read the HTML grammar.
sgmlnorm
-- a simple SGML tag normalizer (as yet undocumented).

Besides being a complete redesign of the earlier successful smgls implementation, the new programs are designed for the future: they support extended character sets, such as Unicode, and various multi-byte encodings used in oriental languages.

Some details

The new code is written almost entirely in C++ (about 45K lines at version 0.4, or twice the size of Don Knuth's TeX or Metafont), and requires template support, a relatively new feature of C++ which is not yet widely available.

WARNING: To build these programs, you will need about 50MB of disk space, unless you remove the default -g compiler option. Doing so reduces the executable sizes from almost 10MB each to about 1MB (on a Sun SPARC Solaris 2.3 system). Alternatively, you can build them, then run the UNIX strip command on the executables to remove debug symbols.

The SP programs can be compiled and built using recent releases of GNU g++ (2.6.1 or later). g++ itself is built as part of the GNU gcc compiler installation; although that installation takes a few hours, and requires about 120MB of disk space to be able to run the validation tests before installation, it is straightforward, and should be problem free on most current UNIX systems. The GNU compiler suite has also been built on IBM PC MS DOS and DEC OpenVMS systems, although those versions usually lag behind.

The SP distribution site has binaries for Sun SunOS and Solaris, so if you have such a system, you may not need to build any of the SP code from scratch, or to install g++.

Just as with sgmls, lengthy command lines are needed to run these programs successfully. To facilitate their use, I've prepared simple UNIX shell scripts html-ncheck and html-spam to hide the complexity, so that only the HTML files need to be provided on the script command lines.

If you have installed the html-check distribution, and you want to use html-spam, you need to add to end of the HTML catalog file, /usr/local/lib/html-check/lib/catalog. these lines:

        -- Added at the suggestion of James Clark <[email protected]> --
        -- so that spam -p doesn't output the contents of html.decl --
SGMLDECL html.decl

Without this change, the contents of html.decl are copied to the output if the -p is included in the spam invocation in html-spam ; omitting -p and including html.decl doesn't help, because the <!DOCTYPE ... > line is then lost.

Installation report

I have successfully built sp-0.4 with g++ ( gcc 2.6.3) on Sun Solaris 2.3 and 2.4 without any problems at all. Build attempts on other architectures have been less successful, though with hand tweaking, most have been made to succeed. My findings are summarized in the remaining sections.

DEC Alpha 3000/300LX OSF/1 3.0

An initial build attempt failed on this system with missing externals at link time.

Correspondence with the author of SP led me to completely remove the installed libraries and include files for gcc and libg++, and then do a fresh rebuild of both systems. I have gone through several generations of the GNU C compiler and C++ library on this system, without having to do this in the past. In any event, this completely fixed the problems, and the SP build completed successfully with no modifications whatever to the Makefile or source code, and the make check validation tests were all successful.

DECstation 3100, 5000 ULTRIX 4.3

An initial build attempt failed on this system, so just as on the DEC Alpha OSF/1 system, gcc and libg++ were completely removed from the system, and then reinstalled. The Makefile was modified to set RANLIB=ranlib, and the build of SP then completely successfully.

Hewlett-Packard HP-9000/735 HP-UX 9.0.3

An initial build attempt failed on this system, so just as on the DEC Alpha OSF/1 system, gcc and libg++ were completely removed from the system. I was able to build gcc 2.6.3 successfully, and the stage1 and stage2 validation tests passed successfully. However, the build of libg++ 2.6.2 failed because of bad code (an undefined branch target symbol) generated by gcc for libstdc++/cMAIN.s. Similar errors showed up in SP 0.4 when I tried to use g++ anyway.

I then searched the gnu.g++ newsgroup for problem reports on this architecture, and found that a patched Hewlett-Packard GNU tools distribution is available locally on ftp://jaguar.cs.utah.edu/dist. While I know some of the people involved in that distribution, in the past, it has lagged the FSF releases, so I had not thought to look there again.

I therefore fetched the hpuxbin.tar.Z distribution file, unbundled it, and installed it on our HP systems. Fortunately, this solved all of the problems. Although the distribution includes the installed libg++ 2.6.2 files, I was curious to see whether the patched compiler would build libg++ properly. It does, although I had to add a -lm library switch to the libg++/gperf/src/Makefile.

SP 0.4 now builds completely correctly on this system with the patched gcc 2.6.3, and the make check validation tests all pass.

I also tried to build SP with the native HP-UX C++ compiler. I made these changes to the top-level Makefile:

CXX=CC
WARN=
DEBUG=
DEFINES=-DANSI_CLASS_INST $(XDEFINES)
CXXFLAGS=$(DEBUG) $(WARN)
CC=c89
Compilation failed at parser/parseDecl.C:
% CC  -I. -I./../lib -DANSI_CLASS_INST   -c parseDecl.C
CC: "parseDecl.C", line 1497: internal error: Extract/deposit of 0
     for field size in EXTRS  not in [1..32] (7103)
and in many subsequent files with messages of the type
% CC -I. -I./../lib -I./../parser -I./../em -I./../app -I./../api \
-DANSI_CLASS_INST -c sgmlnorm.C
CC: "NormMessages.H", line 13: error:  domain_ undefined (1285)
CC: "NormMessages.H", line 13: error: no match for call: const
MessageType0* -> MessageType0:: MessageType0( const enum
MessageType::Severity, any , char [48]) (1735)
CC: "NormMessages.H", line 13: choice of
MessageType0::MessageType0()s:
 CC: "NormMessages.H", line 13:
MessageType0::MessageType0(MessageType::Severity, const char *, const
char *);
 CC: "NormMessages.H", line 13: MessageType0::MessageType0(const
struct MessageType0&);
 CC: "NormMessages.H", line 13: (no usable const member function)
...
I gave up at this point on the HP-UX native C++ compiler installation attempt.

IBM RS/6000 AIX 3.2.5

SP 0.4 is not yet installable on this system.

A few source code patches were required to overcome errors in the IBM C++ header files, and restrictions of the IBM make utility; they are available as a separate file.

gcc 2.6.3 produces compilation errors in em/em_inst.C: the compiler generates branches in the assembly code to undefined symbols. This same error was seen in the Hewlett-Packard version of gcc 2.6.3, but has been fixed in the patched version noted above.

At the suggestion of Joachim Schrod, I removed the -O compiler option, and recompiled the source code: this eliminated the undefined symbol problem.

I used make -i to force the build past the compilation errors. However, this resulted in massive numbers of undefined symbols at link time, when the symbols are in fact defined in the SP libraries.

Compilation with the IBM native xlC compiler was successful, but the same kinds of link errors appeared as happened with the g++ build. I suspect an IBM linker error is the culprit. My suspicion, based on the error messages produced, is that the linker does not correctly handle the mangling of C++ names containing templates.

NeXT Turbostation Mach 3.0

An initial build attempt failed on this system, so just as on the DEC Alpha OSF/1 system, gcc and libg++ were completely removed from the system. I was able to build and install gcc 2.6.3 successfully, and the stage1 and stage2 validation tests passed successfully.

However, the build of libg++ 2.6.2 initially failed because of some minor problems that I was able to prepare workarounds for, and complete the build and installation. A report of the problems and their fixes has been sent to the libg++ developers. If you require the patches for this system, send e-mail to [email protected].

I modified the top-level SP Makefile to set RANLIB=ranlib. The build of SP then completed successfully, and make check passed all of the validation tests.

Silicon Graphics Indigo-2 IRIX 5.3

An initial build attempt failed on this system, so just as on the DEC Alpha OSF/1 system, gcc and libg++ were completely removed from the system, and then reinstalled. The build of SP then completed successfully with make CC=gcc CXX=g++ , and make check passed all of the validation tests.

I also tried a build using the vendor-provided C++ compiler, but it failed utterly because of the lack of template support in the compiler.

Sun SPARC SunOS 4.1.3 with gcc 2.6.2

The trick to a successful build of SP on this system is to make the simple patch in lib/assert.C described in the Silicon Graphics section above, and to make these Makefile changes:

LIBS=-liostream -lg++ $(XLIBS)
RANLIB=ranlib

Once these are done, the build will complete successfully, and so will the validation suite: make check.