Notes about SP version 1.3

Last update: [15-May-1999]                Valid HTML 3.2!

Introduction

Writing HTML files without a validating parser is like trying to write computer programs without a compiler: don't do it! Fortunately, help is readily available on the Internet.

James Clark <[email protected]> is developing a new implementation of a suite of SGML parser tools, called SP. These include:

nsgmls
-- an sgmls-compatible validating SGML parser.
spam
-- an SGML markup stream editor. This program is capable of supplying missing end tags, expanding abbreviated tags, and also of filling in all known attributes in every begin tag. The latter can be handy if you want to find what attributes are available, without having to read the HTML grammar.
sgmlnorm
-- a simple SGML tag normalizer.
spent
-- print SGML entity on the standard output
sx
-- translate SGML (and HTML) to XML
WARNING: The name sx conflicts with an older program distributed as part of the sx, sb, sz - XMODEM, YMODEM, ZMODEM file send package. If you have those older programs installed, you will have to rename one or the other of them. At my site, since most users now employ either PPP or SLIP for terminal connections, the need for those old modem programs has diminished, and in any event, the Columbia University kermit system provides a superior alternative to the [XYZ]MODEM package for file transfer. I therefore renamed the old sx to sx.old.

Besides being a complete redesign of the earlier successful smgls implementation, the new programs are designed for the future: they support extended character sets, such as Unicode, and various multi-byte encodings used in oriental languages.

For a textbook treatment of the use of the SP system, see the useful book

@String{pub-PH                  = "Pren{\-}tice-Hall"}

@String{pub-PH:adr              = "Englewood Cliffs, NJ 07632, USA"}

@Book{McGrath:1997:SS,
  author =       "Sean McGrath",
  title =        "ParseMe.1st: {SGML} for Software Developers",
  publisher =    pub-PH,
  address =      pub-PH:adr,
  pages =        "xxiii + 341",
  month =        jan,
  year =         "1998",
  ISBN =         "0-13-488967-3",
  LCCN =         "QA76.76.H94M388 1998",
  bibdate =      "Sat Jan 4 12:20:44 MST 1997",
  price =        "US\$33.75",
  acknowledgement = ack-nhfb,
  keywords =     "SGML (document markup language)",
  xxprice =      "US\$55.00",
}

Some details

The new code is written almost entirely in C++ (almost 78K lines at version 1.3, or nearly four times the size of Don Knuth's TeX or Metafont), and requires template support, a relatively new feature of C++ which is not yet widely available. [An ANSI/ISO Standard for C++, ISO/IEC 14882:1998 Programming languages -- C++, was finally adopted in 1998, and by mid-1999, a few UNIX vendors claimed conformanance to that Standard.]

WARNING: To build these programs, you will need about 60MB of disk space if you add the -g compiler debug option. Adding -g increases executable sizes about 10MB each from about 2.5MB (on a Sun SPARC Solaris 2.6 system). Without that option, about 20MB of disk space suffices.

The SP programs can be compiled and built using recent releases of GNU g++ and libg++ (2.7.1 or later), or better, the newer, and more-easily buildable, Cygnus egcs development releases at ftp://egcs.cygnus.com/pub/egcs/releases/. g++ itself is built as part of the GNU gcc compiler installation; although that installation takes a few hours, and requires about 120MB of disk space to be able to run the validation tests before installation, it is straightforward, and should be problem free on most current UNIX systems. The GNU compiler suite has also been built on IBM PC MS DOS and DEC OpenVMS systems, although those versions usually lag behind.

The SP distribution site has binaries for SP version 1.3 for IBM PC DOS, and Windows 95 and Windows NT.

Binaries for older versions are available for Intel 386 Linux, Sun Solaris 2.5, and DEC Alpha OSF/1 3.2.

Just as with sgmls, lengthy command lines are needed to run these programs successfully. To facilitate their use, I've prepared simple UNIX shell scripts html-ncheck and html-spam to hide the complexity, so that only the HTML files need to be provided on the script command lines.

If you have installed the html-check distribution, and you want to use html-spam, you need to add to end of the HTML catalog file, /usr/local/lib/html-check/lib/catalog. these lines:

        -- Added at the suggestion of James Clark <[email protected]> --
        -- so that spam -p doesn't output the contents of html.decl --
SGMLDECL html.decl

They have already been added to the distributed catalog file, but you might already have such a file installed that needs the update.

Without this change, the contents of html.decl are copied to the output if the -p is included in the spam invocation in html-spam ; omitting -p and including html.decl doesn't help, because the <!DOCTYPE ... > line is then lost.

Installation report

I have successfully built sp-1.3 with g++ (gcc 2.8.1 [2-Mar-1998] or gcc version egcs-2.91.66 (egcs-1.1.2 source release) on these systems:

using the command

make && make check && make install

On a few of these, minor problems cropped up and were solved; they are discussed further below.

I also made unsuccessful attempts to build SP with native C++ compilers on Hewlett-Packard HP-UX 10.0.1 and Silicon Graphics IRIX 5.3, with a command line like

make CXX=CC CXXFLAGS=-O DEFINES='-DANSI_CLASS_INST $(XDEFINES)'

Numerous compiler errors quickly led to my abandoning the effort.

Compilation with native Sun Solaris 2.3 CC looked initially promising, but linking failed with errors about differing sizes of particular symbols, and with many missing functions arising from template instantiation. This linking problem is just what I found with SP 0.4 on the IBM RS/6000 AIX 3.2.5 systems too.

HP 9000/7xx HP-UX 10.01

I built sp on this system using make XDEFINES='-D_POSIX_SOURCE'; the extra definition is needed to expose the off_t type definition in <sys/types.h>.

HP 9000/7xx HP-UX 10.20

Mail from Michael Riedmann <[email protected]> at Hewlett-Packard GmbH in Böblingen, Germany on 12 May 1998 reported a successful build of SP version 1.3 on HP-UX 10.20 with g++ version 2.7.2.3, after installing HP patch PHKL_8693 to fix a problem with a non-ANSI extern struct declaration in /usr/include/sys/time.h.

IBM RS/6000 AIX 4.2

The function set_new_handler() came up undefined when the code was compiled with egcs-2.91.66. Using make LIBS='-L/usr/local/lib/gcc-lib/powerpc-ibm-aix4.2.1.0/2.8.1 -lgcc' used the gcc-2.8.1 version of the C++ library, and resolved the problem.

Further analysis uncovered the cause of this error: the file include/config.h contains the lines

#ifdef __GNUG__
...            
// set_new_handler() has to be declared extern "C"
#define SP_SET_NEW_HANDLER_EXTERN_C

While this is valid for gcc version 2.8.1 and earlier, the newer egcs defines the set_new_handler() function as a native C++ function, rather than a C function.

Ideally, sp should be built with a simple GNU autoconfigure script to figure out whether SP_SET_NEW_HANDLER_EXTERN_C should be defined or not.

I suspect that the code would have built correctly had I commented out the definition of that symbol in include/config.h, but I chose not to do so, with the goal of building sp everywhere with zero code modifications.

Intel Pentium-II MMX GNU/Linux 2.0.35

The function set_new_handler() came up undefined when the code was compiled with egcs-2.91.66. Using make LIBS=-L/usr/lib used the system version of the C++ library, and resolved the problem.

For an explanation of this anomaly, see the earlier section.

Because GNU/Linux systems are notorious for problems from incompatible versions of shared libraries, I prepared a separate sp-1.3-linux-i686-2.0.35-libs.tar.gz file containing the four libraries needed by the SP executables:

ld-linux.so.2
libc.so.6
libm.so.6
libstdc++.so.2.8

If they prove necessary on your system, you can install those you need in /usr/local/lib, or any other convenient place. If the chosen directory is not already listed in /etc/ld.so.conf, add it, then, as root, run the command ldconfig to update the run-time linker's cache of directories. If you cannot do this, then you can just add the directory to the LD_RUN_PATH environment variable.

NeXT Turbostation Mach 3.0

I modified the top-level SP Makefile to set RANLIB=ranlib. The build of SP then completed successfully, and make check passed all of the validation tests.