Closure XML Parser

An XML parser written in Common Lisp.

Closure XML was written by Gilbert Baumann (unk6 at rz.uni-karlsruhe.de) as part of the Closure web browser.
Contributions to the parser by

Send bug reports to david@lichteblau.com.

Download

Get a tarball.

David's tla archive is at http://www.common-lisp.net/project/cxml/david@knowledgetools.de--cxml/. (Brief tla usage instructions: Unpack the cxml tarball.  Enter tla register-archive URL to turn it into a working copy.  tla update is similar to cvs up.) Note that this used to be www.common-lisp.net and is now just common-lisp.net.

Contents

Recent Changes

patch-357 (2004-10-10)

patch-306 (2004-09-03)

patch-279 (2004-05-11)

patch-204

patch-191 (2004-03-18)

CXML Modules

CXML provides three packages:

Installation

CXML should be portable to all Common Lisp implementations supporting gray streams.  Currently assumed to work are:

Optional configuration (skip this unless you know better): CXML has full Unicode code support -- even on Lisps without Unicode strings. On non-unicode aware Lisps, DOMString is implemented as an array of character codes. CXML will auto-detect at compile-time which string representation to use. To override the auto-detection, you can set one of the features :rune-is-character and :rune-is-octet before loading cxml.asd. (fixme: feature :rune-is-octet is of course misnamed, since it uses 16bit runes, not 8bit runes. It will probably be renamed to :rune-is-integer at some point.)

ASDF is used for compilation. The following instructions assume that ASDF has already been loaded.

Prerequisites. CXML needs the puri library.

Compiling and loading CXML. Register the .asd file, e.g. by symlinking it:

$ ln -sf `pwd`/cxml.asd /path/to/your/registry/

Then compile CXML using:

* (asdf:operate 'asdf:load-op :cxml)

You can then try the quick-start example.

Tests

Check out the XML and DOM testsuites:

$ export CVSROOT=:pserver:anonymous@dev.w3.org:/sources/public
$ cvs login    # password is "anonymous"
$ cvs co 2001/XML-Test-Suite/xmlconf
$ cvs co 2001/DOM-Test-Suite

Usage and expected output:

* (xmlconf:run-all-tests "/path/to/2001/XML-Test-Suite/xmlconf/")
0/556 tests failed; 1606 tests were skipped
* (domtest:run-all-tests "/path/to/2001/DOM-Test-Suite/")
0/450 tests failed; 71 tests were skipped

fixme: Add an explanation of xml/sax-tests here.

fixme My parser does not understand the current testsuite anymore.  To fix this problem, revert the affected files manually after check-out:

$ cd 2001/XML-Test-Suite/xmlconf/
xmltest$ patch -p0 -R </path/to/cxml/test/xmlconf-base.diff

The log message for the changes reads "Removed unnecessary xml:base attribute".  If I understand correctly, only DOM 3 parsers provide the baseURI attribute necessary for understanding xmlconf.xml now.  We don't have that yet.

To Do

(Compare also with Gilbert Baumann's older TODO list in xml-parse.lisp.)

Using CXML

Quick-Start Example

Make sure to install and load cxml first.

Create a test file called example.xml:

* (with-open-file (s "example.xml" :direction :output)
    (write-string "<test a='b'><child/></test>" s))

Parse example.xml into a DOM tree (read more):

* (cxml:parse-file "example.xml" (dom:make-dom-builder))
#<DOM-IMPL::DOCUMENT @ #x72206172>
;; save result for later:
* (defparameter *example* *)
*EXAMPLE*

Inspect the DOM tree (read more):

* (dom:document-element *example*)
#<DOM-IMPL::ELEMENT test @ #x722b6ba2>
* (dom:tag-name (dom:document-element *example*))
"test"
* (dom:child-nodes (dom:document-element *example*))
#(#<DOM-IMPL::ELEMENT child @ #x722b6d8a>)
* (dom:get-attribute (dom:document-element *example*) "a")
"b"

Serialize the DOM document back into a stream (read more):

(cxml:unparse-document *example* *standard-output*)
<test a="b"><child></child></test>

As an alternative to DOM, parse into xmls-compatible list structure (read more):

* (cxml:parse-file "example.xml" (cxml-xmls:make-xmls-builder))
("test" (("a" "b")) ("child" NIL))

Parsing and Validating

Function CXML:PARSE-FILE (pathname handler &key ...)
Function CXML:PARSE-STREAM (stream handler &key ...)
Function CXML:PARSE-OCTETS (octets handler &key ...)
Parse an XML document.  Return values from this function depend on the SAX handler used.
Arguments:

Common keyword arguments:

Function CXML:PARSE-DTD-FILE (pathname)
Function CXML:PARSE-DTD-STREAM (stream)
Parse
declarations from a stand-alone file and return an object representing the DTD, suitable as an argument to validate.

Function CXML:MAKE-EXTID (publicid systemid)
Create an object representing the External ID composed of the specified Public ID, a rod or nil, and System ID (an URI object).

Function DOM:MAKE-DOM-BUILDER ()
Create a SAX handler which builds a DOM document.  Example:

(cxml:parse-file "test.xml" (dom:make-dom-builder))

Serialization

Function CXML:UNPARSE-DOCUMENT (document stream &rest keys)
Function CXML:UNPARSE-DOCUMENT-TO-OCTETS (document &rest keys) => vector
Serialize a DOM document object.

Keyword arguments:

The following canonical values are allowed:

With an indentation level, pretty-print the XML by inserting additional whitespace.  Note that indentation changes the document model and should only be used if whitespace does not matter to the application.

unparse-document-to-octets returns an (unsigned-byte 8) array, whereas unparse-document writes characters.  unparse-document is useful together with with-output-to-string.  However, note that the resulting document in both cases is UTF-8 encoded, so the characters written by unparse-document are really UTF-8 bytes encoded as characters.

Function CXML:MAKE-CHARACTER-STREAM-SINK (stream &rest keys) => sink
Function CXML:MAKE-OCTET-VECTOR-SINK (&rest keys) => sink
Return a handle suitable for event-based XML serialization.

These function provide the low-level mechanism used by the DOM serialization functions. To serialize a document without building its DOM tree first, create a sink handle and call SAX functions on that handle. sax:end-document returns the serialized form of the document described by the SAX events.

Macro CXML:WITH-XML-OUTPUT (sink &body body) => vector
Macro CXML:WITH-ELEMENT (qname &body body) => result
Function CXML:ATTRIBUTE (name value) => value
Function CXML:TEXT (data) => data
Function CXML:CDATA (data) => data
Convenience syntax for event-based serialization.

Example:

(with-xml-output (make-octet-stream-sink stream :indentation 2 :canonical nil)
  (with-element "foo"
    (attribute "xyz" "abc")
    (with-element "bar"
      (attribute "blub" "bla"))
    (text "Hi there.")))

Prints this to stream, which must be an (unsigned-byte 8) stream:

<foo xyz="abc">
  <bar blub="bla"></bar>
  Hi there.
</foo>

(Note that these functions accept both strings and rods, so we could write "foo" instead of #"foo" above.)

Macro XHTML-GENERATOR:WITH-XHTML (sink &rest forms)
Macro XHTML-GENERATOR:WRITE-DOCTYPE (sink)
Macro with-xhtml is a modified version of Franz' html Makro which works as a SAX driver for XHTML. It aims to be a plug-in replacement for the html macro.

xhtmlgen is included as contrib/xhtmlgen.lisp in the cxml distribution. Example:

(let ((sink (cxml:make-character-stream-sink *standard-output*)))
  (sax:start-document sink)
  (xhtml-generator:write-doctype sink)
  (xhtml-generator:with-html sink
    (:html
     (:head
      (:title "Titel"))
     (:body
      ((:p "style" "font-weight: bold")
       "Inhalt")
      (:ul
       (:li "Eins")
       (:li "Zwei")
       (:li "Drei")))))
  (sax:end-document sink))

Miscellaneous Utility Functions

Function CXML:MAKE-VALIDATOR (dtd root)
Create a SAX handler which validates against a DTD instance.  The document's root element must be named root.  Used with dom:map-document, this validates a document object as if by re-reading it with a validating parser, except that declarations recorded in the document instance are completely ignored.
Example:

(let ((d (parse-file "~/test.xml" (dom:make-dom-builder)))
      (x (parse-dtd-file "~/test.dtd")))
  (dom:map-document (cxml:make-validator x #"foo") d))

Function DOM:MAP-DOCUMENT (handler document &key include-xmlns-attributes include-default-values)
Traverse a DOM document and call SAX functions as if an XML representation of the document were processed by a SAX parser.

XMLS Compatibility

Like other XML parsers written in Lisp, CXML can work with documents represented as list structures. The specific model implemented by cxml is compatible with the xmls parser. Xmls list structures are a simpler and faster alternative to full DOM document trees. They also serve as an example showing how to implement user-defined document models as an independent layer over the the base parser (c.f. xml/xmls-compat.lisp in the cxml distribution). However, note that the list structures do not include all information available in DOM documents and are sometimes more difficult to work wth since many DOM functions cannot be implemented on them.

Function CXML-XMLS:MAKE-XMLS-BUILDER ()
Create a SAX handler which builds XMLS list structures.  Example:

(cxml:parse-file "test.xml" (cxml-xmls:make-xmls-builder))

Function CXML-XMLS:MAP-NODE (handler node &key include-xmlns-attributes)
Traverse an XMLS document/node and call SAX functions as if an XML representation of the document were processed by a SAX parser.

Use this function to serialize XMLS data. For example, we could define a replacement for xmls:write-xml like this:

(defun write-xml (stream node &key indent)
  (let ((sink (cxml:make-character-stream-sink
               stream :canonical nil :indentation indent)))
    (cxml-xmls:map-node sink node)))

Function CXML-XMLS:MAKE-NODE (&key name ns attrs children) => xmls node
Build a list node of the form (name ((name value)*child*).

The node list's car can also be a cons of local name and namespace prefix ns. fixme: It is unclear to me how namespaces are meant to work in xmls, since xmls documentation differs from how xmls actually works in current releases. Usually applications need to know both the namespace prefix and the namespace URI. We currently follow the xmls implementation and use the namespace prefix instead of following its documentation which shows the URI. We do not follow xmls in munging xmlns attribute values. Attributes themselves have namespaces and it is not clear to me how that works in xmls.

Accessor CXML-XMLS:NODE-NAME (node)
Accessor CXML-XMLS:NODE-NS (node)
Accessor CXML-XMLS:NODE-ATTRS (node)
Accessor CXML-XMLS:NODE-CHILDREN (node)
Accessors for xmls node data.

Dealing with Rods

As explained above, the XML parser handles character encoding and uses 16bit strings internally. Instead of using characters and strings it uses runes and rods. This is seen as a feature, but can be inconvenient.

Note that the recoder approach does not work with the DOM builder, since DOM is specified to use UTF-16.

Function CXML:MAKE-RECODER (chained-handler &optional recoder-fn)
Return a SAX handler which passes all events on to chained-handler after converting all strings and rods using recoder-fn, a function of one argument which defaults to runes:rod-string.

Example. In a Lisp which ordinarily would use octet vector rods:

CL-USER(14): (cxml:parse-string "<test/>" (cxml-xmls:make-xmls-builder))
(#(116 101 115 116) NIL)

Use a SAX recoder to get strings instead::

CL-USER(17): (parse-string "<test/>" (cxml:make-recoder (cxml-xmls:make-xmls-builder)))
("test" NIL)

Caching of DTD Objects

To avoid spending time parsing the same DTD over and over again, CXML can cache DTD objects. The parser consults cxml:*dtd-cache* whenever it is looking for an external subset in a document which does not have an internal subset and uses the cached DTD instance if one is present in the cache for the System ID in question.

Note that DTDs do not expire from the cache automatically. (Future versions of CXML might introduce automatic checks for outdated DTDs.)

Variable CXML:*DTD-CACHE*
The DTD cache object consulted by the parser when it needs a DTD.

Function CXML:MAKE-DTD-CACHE ()
Return a new, empty DTD cache object.

Variable CXML:*CACHE-ALL-DTDS*
If true, instructs the parser to enter all DTDs that could have been cached into *dtd-cache* if they were not cached already. Defaults to nil.

Reader CXML:GETDTD (uri dtd-cache)
Return a cached instance of the DTD at uri, if present in the cache, or nil.

Writer CXML:GETDTD (uri dtd-cache)
Enter a new value for uri into dtd-cache.

Function CXML:REMDTD (uri dtd-cache)
Ensure that no DTD is recorded for uri in the cache and return true if such a DTD was present.

Function CXML:CLEAR-DTD-CACHE (dtd-cache)
Remove all entries from dtd-cache.

fixme: thread-safety

XML Catalogs

External entities (for example, DTDs) are referred to using their Public and System IDs. Usually the System ID, a URI, is used to locate the entity. CXML itself handles only file://-URIs, but many System IDs in practical use are http://-URIs. There are two different mechanims applications can use to allow CXML to locate entities using arbitrary Public ID or System ID:

This section describes XML Catalogs, the second solution. CXML implements Oasis XML Catalogs.

Variable CXML:*CATALOG*
The XML Catalog object consulted by the parser before trying to open an entity. Initially nil.

Variable CXML:*PREFER*
The default "prefer" mode from the Catalog specification, one of :public or :system. Defaults to :public.

Function CXML:MAKE-CATALOG (&optional uris)
Return a catalog object for the catalog files specified.

Function CXML:RESOLVE-URI (uri catalog)
Look up uri in catalog and return the resulting URI, or nil if no match was found.

Function CXML:RESOLVE-EXTID (publicid systemid catalog)
Look up the External ID (publicid, systemid) in catalog and return the resulting URI, or nil if no match was found.

Example:

* (setf cxml:*catalog* nil)
* (cxml:parse-file "test.xhtml" nil)
=> Error: URI scheme :HTTP not supported

* (setf cxml:*catalog* (cxml:make-catalog))
* (cxml:parse-file "test.xhtml" nil)
;; no error!
NIL

Note that parsed catalog files are cached in the catalog object. Catalog files cached do not expire automatically. To ensure that all catalog files are parsed again, create a new catalog object.

SAX Interface

A SAX handler is an arbitrary objects that implements some of the generic functions in the SAX package.  Note that no default handler class is necessary, because all generic functions have default methods which do nothing.  SAX functions are:

Function SAX:START-DOCUMENT (handler)
Function SAX:END-DOCUMENT (handler)

Function SAX:START-ELEMENT (handler namespace-uri local-name qname attributes)
Function SAX:END-ELEMENT (handler namespace-uri local-name qname)
Function SAX:START-PREFIX-MAPPING (handler prefix uri)
Function SAX:END-PREFIX-MAPPING (handler prefix)
Function SAX:PROCESSING-INSTRUCTION (handler target data)
Function SAX:COMMENT (handler data)
Function SAX:START-CDATA (handler)
Function SAX:END-CDATA (handler)
Function SAX:CHARACTERS (handler data)

Function SAX:START-DTD (handler name public-id system-id)
Function SAX:END-DTD (handler)
Function SAX:UNPARSED-ENTITY-DECLARATION (handler name public-id system-id notation-name)
Function SAX:EXTERNAL-ENTITY-DECLARATION (handler kind name public-id system-id)
Function SAX:INTERNAL-ENTITY-DECLARATION (handler kind name value)
Function SAX:NOTATION-DECLARATION (handler name public-id system-id)
Function SAX:ELEMENT-DECLARATION (handler name model)
Function SAX:ATTRIBUTE-DECLARATION (handler ename aname type default)

Accessor SAX:ATTRIBUTE-PREFIX (attribute)
Accessor SAX:ATTRIBUTE-NAMESPACE-URI (attribute)
Accessor SAX:ATTRIBUTE-LOCAL-NAME (attribute)
Accessor SAX:ATTRIBUTE-VALUE (attribute)
Accessor SAX:ATTRIBUTE-QNAME (attribute)
Accessor SAX:ATTRIBUTE-SPECIFIED-P (attribute)

The entity declaration methods are similar to Java SAX definitions, but parameter entities are distinguished from general entities not by a % prefix to the name, but by the kind argument, either :parameter or :general.

The arguments to sax:element-declaration and sax:attribute-declaration differ significantly from their Java counterparts.

fixme: For more information on these functions refer to the docstrings.

DOM Notes

CXML implements the DOM Level 1 Core interfaces.  Explaining DOM is better left to the specification, so please refer to the official W3C documents for DOM.

However, there is no "standard" DOM mapping for Lisp.  DOM is specified in CORBA IDL, but it refrains from using object-oriented IDL features, allowing for a much more natural Lisp implemenation than the the ordinary IDL/Lisp mapping would.

Differences between CXML's DOM and the direct IDL/Lisp mapping:

Example:

XML(97): (dom:node-type
          (dom:document-element
           (cxml:parse-file "~/test.xml" (dom:make-dom-builder))))
:ELEMENT