Closure XML Parser
An XML parser written in Common Lisp.
Closure XML was written by Gilbert Baumann
(unk6 at rz.uni-karlsruhe.de) as part of the Closure web
browser.
Contributions to the parser by
-
Henrik Motakef (hmot at henrik-motakef.de)
(SAX layer; namespace support)
-
David Lichteblau for knowledgeTools
(conversion into an independent package; DOM bug fixing; validation)
and headcraft
(most of the october 2004 release).
Send bug reports to david@lichteblau.com.
Download
Get a tarball.
David's tla archive is at http://www.common-lisp.net/project/cxml/david@knowledgetools.de--cxml/.
(Brief tla usage instructions: Unpack the cxml tarball.
Enter tla register-archive URL to turn it into a working
copy. tla update is similar to cvs up.)
Note that this used to
be www.common-lisp.net and is
now just common-lisp.net.
Contents
Recent Changes
patch-357 (2004-10-10)
- Auto-detect unicode support for better asdf-installability.
- Use the puri library for Sys-ID handling.
- Semi-automatic caching of DTD instances.
- Support user-defined entity resolvers.
- Support for Oasis XML Catalogs.
- xhtmlgen version of Franz htmlgen.
- Fixes for SBCL's unicode support.
patch-306 (2004-09-03)
- Event-based serialization which does not require DOM documents
- XMLS compatiblity
- minor bugfixes (thread safety; should work on clisp again)
patch-279 (2004-05-11)
- Validation
- bugfixes; XHTML DTD parses again; corrected SAX entity handling
patch-204
- Renamed package XML to CXML.
- The unparse functions support non-canonical output now.
patch-191 (2004-03-18)
CXML Modules
CXML provides three packages:
-
RUNES, a portable implementation of Unicode strings.
-
CXML, a namespace-aware validating SAX parser
implementing the XML 1.0
specification.
-
DOM, an implementation of the DOM
Level 1 Core interfaces.
Installation
CXML should be portable to all Common Lisp implementations
supporting gray streams. Currently assumed to work are:
-
ACL (with support for rune-is-character in the
unicode-enabled images)
-
SBCL. The rune-is-character mode needs SBCL's Unicode
branch ("character_branch"). Note that cxml still uses
surrogate characters instead of utilizing full 21bit characters.
This will probably addressed in a future release.)
-
CMUCL (no support for rune-is-character)
-
CLISP (reported to work with and without rune-is-character).
CLISP needs to be run with an option like -E iso-8869-1
teaching it to accept cxml's non-ASCII source files.
Optional configuration (skip this unless you know better): CXML
has full Unicode code support -- even on Lisps without Unicode
strings. On non-unicode aware Lisps, DOMString is
implemented as an array of character codes. CXML will auto-detect
at compile-time which string representation to use. To override
the auto-detection, you can set one of the features
:rune-is-character and :rune-is-octet before
loading cxml.asd. (fixme: feature
:rune-is-octet is of course misnamed, since it uses 16bit
runes, not 8bit runes. It will probably be renamed
to :rune-is-integer at some point.)
ASDF is used for
compilation. The following instructions assume that ASDF has
already been loaded.
Prerequisites.
CXML needs the puri library.
Compiling and loading CXML.
Register the .asd file, e.g. by symlinking it:
$ ln -sf `pwd`/cxml.asd /path/to/your/registry/
Then compile CXML using:
* (asdf:operate 'asdf:load-op :cxml)
You can then try the quick-start example.
Tests
Check out the XML and DOM testsuites:
$ export CVSROOT=:pserver:anonymous@dev.w3.org:/sources/public
$ cvs login # password is "anonymous"
$ cvs co 2001/XML-Test-Suite/xmlconf
$ cvs co 2001/DOM-Test-Suite
Usage and expected output:
* (xmlconf:run-all-tests "/path/to/2001/XML-Test-Suite/xmlconf/")
0/556 tests failed; 1606 tests were skipped
* (domtest:run-all-tests "/path/to/2001/DOM-Test-Suite/")
0/450 tests failed; 71 tests were skipped
fixme: Add an explanation of xml/sax-tests here.
fixme My parser does not understand the current testsuite
anymore. To fix this problem, revert the affected files
manually after check-out:
$ cd 2001/XML-Test-Suite/xmlconf/
xmltest$ patch -p0 -R </path/to/cxml/test/xmlconf-base.diff
The log message for the changes reads "Removed unnecessary
xml:base attribute". If I understand correctly, only
DOM 3 parsers provide the baseURI attribute necessary for
understanding xmlconf.xml now. We don't have that
yet.
To Do
-
David's changes might have affected performance. Some
benchmarking needs to be done here. (The actual parser
seems to be faster than xmls -- good enough for me.)
-
DOM in general is pretty heavyweight. There is/was a
"simple-dom" which should be faster. This might be worth
reviving.
-
For those who don't like DOM at all, it would be a very simple
exercise to write a SAX handler for "Lisp-XML" output instead of
DOM. (done)
-
The serializer supports only Canonical
XML right now. In the future we want support for:
Including doctype declarations in the output
(done), ordinary output with less character reference
noise (done), optional indentation
(done), user-specified encoding, etc.
-
There are still thread-safety issues. (fixed)
-
Validation! (done)
-
Upgrade to DOM Level 2 for complete namespace support.
-
Unless rune-is-character is enabled, rod hashing
currently uses equalp hash tables, which can be very slow.
(See %make-rod-hash-table.)
(Compare also with Gilbert Baumann's older TODO list in
xml-parse.lisp.)
Using CXML
Quick-Start Example
Make sure to install and load cxml first.
Create a test file called example.xml:
* (with-open-file (s "example.xml" :direction :output)
(write-string "<test a='b'><child/></test>" s))
Parse example.xml into a DOM tree (read
more):
* (cxml:parse-file "example.xml" (dom:make-dom-builder))
#<DOM-IMPL::DOCUMENT @ #x72206172>
;; save result for later:
* (defparameter *example* *)
*EXAMPLE*
Inspect the DOM tree (read more):
* (dom:document-element *example*)
#<DOM-IMPL::ELEMENT test @ #x722b6ba2>
* (dom:tag-name (dom:document-element *example*))
"test"
* (dom:child-nodes (dom:document-element *example*))
#(#<DOM-IMPL::ELEMENT child @ #x722b6d8a>)
* (dom:get-attribute (dom:document-element *example*) "a")
"b"
Serialize the DOM document back into a stream (read more):
(cxml:unparse-document *example* *standard-output*)
<test a="b"><child></child></test>
As an alternative to DOM, parse into xmls-compatible list
structure (read more):
* (cxml:parse-file "example.xml" (cxml-xmls:make-xmls-builder))
("test" (("a" "b")) ("child" NIL))
Parsing and Validating
Function CXML:PARSE-FILE (pathname handler &key ...)
Function CXML:PARSE-STREAM (stream handler &key ...)
Function CXML:PARSE-OCTETS (octets handler &key ...)
Parse an XML document.
Return values from this function depend on the SAX handler used.
Arguments:
- pathname -- a Common Lisp pathname
- stream -- a Common Lisp stream with element-type
(unsigned-byte 8)
- octets -- an (unsigned-byte 8) array
- handler -- a SAX handler
Common keyword arguments:
-
validate -- A boolean. Defaults to
nil. If true, parse in validating mode, i.e. assert that
the document contains a DOCTYPE declaration and conforms to the
DTD declared.
-
dtd -- unless nil, an extid instance
specifying the external subset to load. This options overrides
the extid specified in the document type declaration, if any.
See below for make-extid. This option is useful
for verification purposes together with the root
and disallow-internal-subset arguments.
- root -- the expected root element
name, or nil (the default).
-
entity-resolver -- nil or a function of two
arguments which is invoked for every entity referenced by the
document with the entity's Public ID (a rod) and System ID (an
URI object) as arguments. The function may either return
nil, CXML will then try to resolve the entity as usual.
Alternatively it may return a Common Lisp stream specialized on
(unsigned-byte 8) which will be used instead. (It may
also signal an error, of course, which can be useful to prohibit
parsed XML documents from including arbitrary files readable by
the parser.)
-
disallow-internal-subset -- a boolean. If true, signal
an error if the document contains an internal subset.
Function CXML:PARSE-DTD-FILE (pathname)
Function CXML:PARSE-DTD-STREAM (stream)
Parse declarations
from a stand-alone file and return an object representing the DTD,
suitable as an argument to validate.
- pathname -- a Common Lisp pathname
- stream -- a Common Lisp stream with element-type
(unsigned-byte 8)
Function CXML:MAKE-EXTID (publicid systemid)
Create an object representing the External ID composed
of the specified Public ID, a rod or nil, and System ID
(an URI object).
Function DOM:MAKE-DOM-BUILDER ()
Create a SAX handler which builds a DOM document. Example:
(cxml:parse-file "test.xml" (dom:make-dom-builder))
Serialization
Function CXML:UNPARSE-DOCUMENT (document stream &rest keys)
Function CXML:UNPARSE-DOCUMENT-TO-OCTETS (document &rest keys) => vector
Serialize a DOM document object.
- document -- a DOM document object
- stream -- a Common Lisp stream with element-type
character
Keyword arguments:
-
canonical -- canonical form, one of NIL, T, 1, 2
-
indentation -- indentation level. An integer or nil.
The following canonical values are allowed:
With an indentation level, pretty-print the XML by
inserting additional whitespace. Note that indentation
changes the document model and should only be used if whitespace
does not matter to the application.
unparse-document-to-octets returns an (unsigned-byte
8) array, whereas unparse-document writes
characters. unparse-document is useful together
with with-output-to-string. However, note that the
resulting document in both cases is UTF-8 encoded, so the
characters written by unparse-document are really UTF-8
bytes encoded as characters.
Function CXML:MAKE-CHARACTER-STREAM-SINK (stream &rest keys) => sink
Function CXML:MAKE-OCTET-VECTOR-SINK (&rest keys) => sink
Return a handle suitable for event-based XML serialization.
These function provide the low-level mechanism used by the DOM
serialization functions. To serialize a document without building
its DOM tree first, create a sink handle and call SAX functions on that
handle. sax:end-document returns the serialized form of
the document described by the SAX events.
Macro CXML:WITH-XML-OUTPUT (sink &body body) => vector
Macro CXML:WITH-ELEMENT (qname &body body) => result
Function CXML:ATTRIBUTE (name value) => value
Function CXML:TEXT (data) => data
Function CXML:CDATA (data) => data
Convenience syntax for event-based serialization.
Example:
(with-xml-output (make-octet-stream-sink stream :indentation 2 :canonical nil)
(with-element "foo"
(attribute "xyz" "abc")
(with-element "bar"
(attribute "blub" "bla"))
(text "Hi there.")))
Prints this to stream, which must be an
(unsigned-byte 8) stream:
<foo xyz="abc">
<bar blub="bla"></bar>
Hi there.
</foo>
(Note that these functions accept both strings and rods, so we
could write "foo" instead of #"foo" above.)
Macro XHTML-GENERATOR:WITH-XHTML (sink &rest forms)
Macro XHTML-GENERATOR:WRITE-DOCTYPE (sink)
Macro with-xhtml is a modified version of
Franz' html Makro which works as a SAX driver for XHTML.
It aims to be a plug-in replacement for the html macro.
xhtmlgen is included as contrib/xhtmlgen.lisp in
the cxml distribution. Example:
(let ((sink (cxml:make-character-stream-sink *standard-output*)))
(sax:start-document sink)
(xhtml-generator:write-doctype sink)
(xhtml-generator:with-html sink
(:html
(:head
(:title "Titel"))
(:body
((:p "style" "font-weight: bold")
"Inhalt")
(:ul
(:li "Eins")
(:li "Zwei")
(:li "Drei")))))
(sax:end-document sink))
Miscellaneous Utility Functions
Function CXML:MAKE-VALIDATOR (dtd root)
Create a SAX handler which validates against a DTD instance.
The document's root element must be named root.
Used with dom:map-document, this validates a document
object as if by re-reading it with a validating parser, except
that declarations recorded in the document instance are completely
ignored.
Example:
(let ((d (parse-file "~/test.xml" (dom:make-dom-builder)))
(x (parse-dtd-file "~/test.dtd")))
(dom:map-document (cxml:make-validator x #"foo") d))
Function DOM:MAP-DOCUMENT (handler document &key include-xmlns-attributes include-default-values)
Traverse a DOM document and call SAX functions as if an XML
representation of the document were processed by a SAX parser.
XMLS Compatibility
Like other XML parsers written in Lisp, CXML can work with
documents represented as list structures. The specific model
implemented by cxml is compatible with the xmls parser. Xmls
list structures are a simpler and faster alternative to full DOM
document trees. They also serve as an example showing how to
implement user-defined document models as an independent layer
over the the base parser (c.f. xml/xmls-compat.lisp in
the cxml distribution). However, note that the list structures do
not include all information available in DOM documents and are
sometimes more difficult to work wth since many DOM functions
cannot be implemented on them.
Function CXML-XMLS:MAKE-XMLS-BUILDER ()
Create a SAX handler which builds XMLS list structures.
Example:
(cxml:parse-file "test.xml" (cxml-xmls:make-xmls-builder))
Function CXML-XMLS:MAP-NODE (handler node &key include-xmlns-attributes)
Traverse an XMLS document/node and call SAX functions as if an XML
representation of the document were processed by a SAX parser.
Use this function to serialize XMLS data. For example, we could
define a replacement for xmls:write-xml like this:
(defun write-xml (stream node &key indent)
(let ((sink (cxml:make-character-stream-sink
stream :canonical nil :indentation indent)))
(cxml-xmls:map-node sink node)))
Function CXML-XMLS:MAKE-NODE (&key name ns attrs
children) => xmls node
Build a list node of the form
(name ((name value)*) child*).
The node list's car can also be a cons of local name
and namespace prefix ns.
fixme: It is unclear to me how namespaces are meant to
work in xmls, since xmls documentation differs from how xmls
actually works in current releases. Usually applications need to
know both the namespace prefix and the namespace URI. We
currently follow the xmls implementation and use the
namespace prefix instead of following its documentation which
shows the URI. We do not follow xmls in munging xmlns attribute
values. Attributes themselves have namespaces and it is not clear
to me how that works in xmls.
Accessor CXML-XMLS:NODE-NAME (node)
Accessor CXML-XMLS:NODE-NS (node)
Accessor CXML-XMLS:NODE-ATTRS (node)
Accessor CXML-XMLS:NODE-CHILDREN (node)
Accessors for xmls node data.
Dealing with Rods
As explained above, the XML parser handles character encoding and
uses 16bit strings internally. Instead of using characters and strings
it uses runes and rods. This is seen as a
feature, but can be inconvenient.
-
If your Lisp supports 16 bit unicode strings, use feature
:rune-is-character and forget about runes and rods.
CXML will use ordinary Lisp characters and strings both
internally and externally.
-
If your Lisp does not support such strings and your application
needs Unicode support, use functions defined in the
runes package instead of ordinary string operators.
-
If your Lisp does not support such strings and your application
does not need Unicode support anyway, it will probably be more
convenient to let CXML convert rods into strings automatically.
To do that, use cxml:make-recoder to chain a special
sax handler between the parser and your application handler.
The recoder translates all rods using an application defined
function, which defaults to runes:rod-string. Although
the actual XML parser still uses rods internally, you SAX
handler will only see ordinary Lisp strings.
Note that the recoder approach does not work with the DOM
builder, since DOM is specified to use UTF-16.
Function CXML:MAKE-RECODER (chained-handler &optional recoder-fn)
Return a SAX handler which passes all events on to
chained-handler after converting all strings and rods
using recoder-fn, a function of one argument which
defaults to runes:rod-string.
Example. In a Lisp which ordinarily would use octet vector rods:
CL-USER(14): (cxml:parse-string "<test/>" (cxml-xmls:make-xmls-builder))
(#(116 101 115 116) NIL)
Use a SAX recoder to get strings instead::
CL-USER(17): (parse-string "<test/>" (cxml:make-recoder (cxml-xmls:make-xmls-builder)))
("test" NIL)
Caching of DTD Objects
To avoid spending time parsing the same DTD over and over again,
CXML can cache DTD objects. The parser consults
cxml:*dtd-cache* whenever it is looking for an external
subset in a document which does not have an internal subset and
uses the cached DTD instance if one is present in the cache for
the System ID in question.
Note that DTDs do not expire from the cache automatically.
(Future versions of CXML might introduce automatic checks for
outdated DTDs.)
Variable CXML:*DTD-CACHE*
The DTD cache object consulted by the parser when it needs a DTD.
Function CXML:MAKE-DTD-CACHE ()
Return a new, empty DTD cache object.
Variable CXML:*CACHE-ALL-DTDS*
If true, instructs the parser to enter all DTDs that could have
been cached into *dtd-cache* if they were not cached
already. Defaults to nil.
Reader CXML:GETDTD (uri dtd-cache)
Return a cached instance of the DTD at uri, if present in
the cache, or nil.
Writer CXML:GETDTD (uri dtd-cache)
Enter a new value for uri into dtd-cache.
Function CXML:REMDTD (uri dtd-cache)
Ensure that no DTD is recorded for uri in the cache and
return true if such a DTD was present.
Function CXML:CLEAR-DTD-CACHE (dtd-cache)
Remove all entries from dtd-cache.
fixme: thread-safety
XML Catalogs
External entities (for example, DTDs) are referred to using their
Public and System IDs. Usually the System ID, a URI, is used to
locate the entity. CXML itself handles only file://-URIs, but
many System IDs in practical use are http://-URIs. There are two
different mechanims applications can use to allow CXML to locate
entities using arbitrary Public ID or System ID:
-
User-defined entity resolvers can be used to open entities using
arbitrary protocols. For example, an entity resolver could
handle all System-IDs with the http scheme using some
HTTP library. Refer to the description of the
entity-resolver keyword argument to parser functions (see cxml:parse-file) to more
information on entity resolvers.
-
XML Catalogs are (local) tables in XML syntax which map External
IDs to alternative System IDs. If, say, the xhtml DTD is
present in the local file system and the local copy has been
registered with the XML catalog, CXML will use the local copy of
the DTD instead of trying to open the version available using HTTP.
This section describes XML Catalogs, the second solution. CXML
implements Oasis
XML Catalogs.
Variable CXML:*CATALOG*
The XML Catalog object consulted by the parser before trying to
open an entity. Initially nil.
Variable CXML:*PREFER*
The default "prefer" mode from the Catalog specification, one
of :public or :system. Defaults
to :public.
Function CXML:MAKE-CATALOG (&optional uris)
Return a catalog object for the catalog files specified.
Function CXML:RESOLVE-URI (uri catalog)
Look up uri in catalog and return the
resulting URI, or nil if no match was found.
Function CXML:RESOLVE-EXTID (publicid systemid catalog)
Look up the External ID (publicid, systemid)
in catalog and return the resulting URI, or nil
if no match was found.
Example:
* (setf cxml:*catalog* nil)
* (cxml:parse-file "test.xhtml" nil)
=> Error: URI scheme :HTTP not supported
* (setf cxml:*catalog* (cxml:make-catalog))
* (cxml:parse-file "test.xhtml" nil)
;; no error!
NIL
Note that parsed catalog files are cached in the catalog object.
Catalog files cached do not expire automatically. To ensure that
all catalog files are parsed again, create a new catalog object.
SAX Interface
A SAX handler is an arbitrary objects that implements some of the
generic functions in the SAX package. Note that no default
handler class is necessary, because all generic functions have default
methods which do nothing. SAX functions are:
Function SAX:START-DOCUMENT (handler)
Function SAX:END-DOCUMENT (handler)
Function SAX:START-ELEMENT (handler namespace-uri local-name qname attributes)
Function SAX:END-ELEMENT (handler namespace-uri local-name qname)
Function SAX:START-PREFIX-MAPPING (handler prefix uri)
Function SAX:END-PREFIX-MAPPING (handler prefix)
Function SAX:PROCESSING-INSTRUCTION (handler target data)
Function SAX:COMMENT (handler data)
Function SAX:START-CDATA (handler)
Function SAX:END-CDATA (handler)
Function SAX:CHARACTERS (handler data)
Function SAX:START-DTD (handler name public-id system-id)
Function SAX:END-DTD (handler)
Function SAX:UNPARSED-ENTITY-DECLARATION (handler name public-id system-id notation-name)
Function SAX:EXTERNAL-ENTITY-DECLARATION (handler kind name public-id system-id)
Function SAX:INTERNAL-ENTITY-DECLARATION (handler kind name value)
Function SAX:NOTATION-DECLARATION (handler name public-id system-id)
Function SAX:ELEMENT-DECLARATION (handler name model)
Function SAX:ATTRIBUTE-DECLARATION (handler ename aname type default)
Accessor SAX:ATTRIBUTE-PREFIX (attribute)
Accessor SAX:ATTRIBUTE-NAMESPACE-URI (attribute)
Accessor SAX:ATTRIBUTE-LOCAL-NAME (attribute)
Accessor SAX:ATTRIBUTE-VALUE (attribute)
Accessor SAX:ATTRIBUTE-QNAME (attribute)
Accessor SAX:ATTRIBUTE-SPECIFIED-P (attribute)
The entity declaration methods are similar to Java SAX
definitions, but parameter entities are distinguished from
general entities not by a % prefix to the name, but by
the kind argument, either :parameter or
:general.
The arguments to sax:element-declaration and
sax:attribute-declaration differ significantly from their
Java counterparts.
fixme: For more information on these functions refer to the docstrings.
DOM Notes
CXML implements the DOM Level 1 Core interfaces. Explaining
DOM is better left to the specification,
so please refer to the official W3C documents for DOM.
However, there is no "standard" DOM mapping for Lisp. DOM
is specified
in CORBA IDL, but it refrains from using object-oriented IDL
features, allowing for a much more natural Lisp implemenation than
the the ordinary IDL/Lisp mapping would.
Differences between CXML's DOM and the direct IDL/Lisp mapping:
-
DOM function names are symbols in the DOM package (not
the OP package).
-
DOM functions have proper required arguments, not a huge
&rest lambda list.
-
Although most IDL interfaces are implemented as CLOS classes by
CXML, the Lisp types of DOM objects is not documented and cannot
be relied upon. A node's type can be determined using
dom:node-type instead.
-
DOMString is mapped to rod, which is either
an (unsigned-byte 16) array type or a string type.
-
The IDL/Lisp mapping maps CORBA enums to Lisp keywords.
Unfortunately, the DOM IDL does not use enums. Instead,
both exception types and node types are defined integer
constants. CXML chooses to ignore this definition and uses
keywords instead.
-
DOM uses StudlyCaps. Lisp programmers don't. We
insert #\- before every upper case letter preceded by a
lower case letter and before every upper case letter which is
followed by a lower case letter, but preceded by a capital
letter. This algorithms leads to the natural Lisp spelling
of DOM function names.
-
Implementation note: DOM's NodeList does not
necessarily map to a native "sequence" type. (For example,
node lists are objects in Java, not arrays.)
NodeList is specified to reflect changes done after a
node list was created, so node lists cannot be Lisp lists.
(A node list could be implemented as a CLOS object pointing to
said list though.) Instead, CXML currently implements node
lists as adjustable vectors. Note that code which relies on
this implementation and uses Lisp sequence functions
instead of sticking to dom:item and dom:length
is not portable. As a compromise, you can use our
extensions dom:map-node-list or
dom:do-node-list, which can be implemented portably.
Example:
XML(97): (dom:node-type
(dom:document-element
(cxml:parse-file "~/test.xml" (dom:make-dom-builder))))
:ELEMENT