Closure XML Parser
An XML parser written in Common Lisp.
Closure XML was written by Gilbert Baumann
(unk6 at rz.uni-karlsruhe.de) as part of the Closure web
browser.
Contributions to the parser by
-
Henrik Motakef (hmot at henrik-motakef.de)
(SAX layer; namespace support)
-
David Lichteblau at knowledgeTools (david at knowledgetools.de)
(conversion into an independent package; DOM bug fixing; validation)
Mailing list cxml-devel
is hosted on common-lisp.net.
Download
Get a tarball.
There is no CVS repository (yet).
You can check out David's tla archive at http://www.common-lisp.net/project/cxml/david@knowledgetools.de--cxml/.
(Brief tla usage instructions: Unpack the cxml tarball.
Enter tla register-archive URL to turn it into a working
copy. tla update is similar to cvs up.)
Recent Development
patch-279
- Validation
- bugfixes; XHTML DTD parses again; corrected SAX entity handling
patch-204
- Renamed package XML to CXML.
- The unparse functions support non-canonical output now.
Contents
CXML Modules
CXML provides three packages:
-
RUNES, a portable implementation of Unicode strings.
-
CXML, a namespace-aware validating SAX parser
implementing the XML 1.0
specification.
-
DOM, an implementation of the DOM
Level 1 Core interfaces.
Installation
CXML is written in Common Lisp and should be portable to all
Common Lisp implementations. Currently assumed to work are
ACL, SBCL, CMUCL, and CLISP, though development is done on
ACL. (CLISP needs some -E option teaching it to
accept non-ASCII source files.)
ASDF is used for
compilation. The following instructions assume that ASDF has
already been loaded.
Configuration (optional).
CXML has full Unicode code support -- even on Lisps without
Unicode strings. On non-unicode aware Lisps, DOMString
is implemented as an array of character codes. If your Lisp
supports 16 bit characters natively, you can enable feature
RUNE-IS-CHARACTER to select an alternative
DOMString implementatation, which uses real characters
instead of characters codes.
* (pushnew :rune-is-character *features*)
Compiling and loading CXML.
Register the .asd file, e.g. by symlinking it:
$ ln -sf `pwd`/cxms.asd /path/to/your/registry
Then compile CXML using:
* (asdf:operate 'asdf:load-op :cxml)
Tests
Check out the XML and DOM testsuites:
$ export CVSROOT=:pserver:anonymous@dev.w3.org:/sources/public
$ cvs login # password is "anonymous"
$ cvs co 2001/XML-Test-Suite/xmlconf
$ cvs co 2001/DOM-Test-Suite
Usage and expected output:
* (xmlconf:run-all-tests "/path/to/2001/XML-Test-Suite/xmlconf/")
0/556 tests failed; 1606 tests were skipped
* (domtest:run-all-tests "/path/to/2001/DOM-Test-Suite/")
0/450 tests failed; 71 tests were skipped
fixme: Add an explanation of xml/sax-tests here.
fixme My parser does not understand the current testsuite
anymore. To fix this problem, revert the affected files
manually after check-out:
$ cd 2001/XML-Test-Suite/xmlconf/
xmltest$ patch -p0 -R </path/to/cxml/test/xmlconf-base.diff
The log message for the changes reads "Removed unnecessary
xml:base attribute". If I understand correctly, only
DOM 3 parsers provide the baseURI attribute necessary for
understanding xmlconf.xml now. We don't have that
yet.
To do
-
David's changes might have affected performance. Some
benchmarking needs to be done here.
-
DOM in general is pretty heavyweight. There is/was a
"simple-dom" which should be faster. This might be worth
reviving.
-
For those who don't like DOM at all, it would be a very simple
exercise to write a SAX handler for "Lisp-XML" output instead of
DOM. Other ideas include Erik Naggum's quads.
-
The serializer supports only Canonical
XML right now. In the future we want support for:
Including doctype declarations in the output,
ordinary output
with less character reference noise (done), optional
indentation (done), user-specified encoding, etc.
-
There are still thread-safety issues.
-
Validation! (done)
-
Upgrade to DOM Level 2 for complete namespace support.
-
Unless rune-is-character is enabled, rod hashing
currently uses equalp hash tables, which is idiotic.
(See %make-rod-hash-table.)
(Compare also with Gilbert Baumann's older TODO list in
xml-parse.lisp.)
Using the parser
Function CXML:PARSE-FILE (pathname handler &key validate root)
Function CXML:PARSE-STREAM (stream handler &key validate root)
Function CXML:PARSE-OCTETS (octets handler &key validate root)
Parse a CXML document.
Return values from this function depend on the SAX handler used.
Arguments:
- pathname -- a Common Lisp pathname
- stream -- a Common Lisp stream with element-type
(unsigned-byte 8)
- octets -- an (unsigned-byte 8) array
- handler -- a SAX handler
Common keyword arguments:
- validate -- t, nil, or a DTD
instance. Defaults to nil.
- root -- nil or the expected root element
name as a rod.
Arguments to validate:
- nil -- do not validate
- t -- assert that the document contains a DOCTYPE
declaration and conforms to the DTD declared.
- a DTD instance -- assert that the document conforms to the DTD
passed as an argument (as opposed to the DOCTYPE declaration, if
any).
When validating, argument root can be used to override the
name given in the DOCTYPE declaration. This is useful
together with a caller-specified DTD instance.
Function CXML:PARSE-DTD-FILE (pathname)
Function CXML:PARSE-DTD-STREAM (stream)
Parse declarations
from a stand-alone file and return an object representing the DTD,
suitable as an argument to validate.
- pathname -- a Common Lisp pathname
- stream -- a Common Lisp stream with element-type
(unsigned-byte 8)
Function DOM:MAKE-DOM-BUILDER ()
Create a SAX handler which builds a DOM document. Example:
(cxml:parse-file "test.xml" (dom:make-dom-builder))
Function CXML:UNPARSE-DOCUMENT (document stream &rest keys)
Function CXML:UNPARSE-DOCUMENT-TO-OCTETS (document &rest keys) => vector
Serialize a DOM document object.
- document -- a DOM document object
- stream -- a Common Lisp stream with element-type
character
Keyword arguments:
-
canonical -- canonical form, one of NIL, T, 1, 2
-
indentation -- indentation level. An integer or nil.
The following canonical values are allowed:
With an indentation level, pretty-print the XML by
inserting additional whitespace. Note that indentation
changes the document model and should only be used if whitespace
does not matter to the application.
unparse-document-to-octets returns an (unsigned-byte
8) array, whereas unparse-document writes
characters. unparse-document is useful together
with with-output-to-string. However, note that the
resulting document in both cases is UTF-8 encoded, so the
characters written by unparse-document are really UTF-8
bytes encoded as characters.
Function CXML:MAKE-VALIDATOR (dtd root)
Create a SAX handler which validates against a DTD instance.
The document's root element must be named root.
Used with dom:map-document, this validates a document
object as if by re-reading it with a validating parser, except
that declarations recorded in the document instance are completely
ignored.
Example:
(let ((d (parse-file "~/test.xml" (dom:make-dom-builder)))
(x (parse-dtd-file "~/test.dtd")))
(dom:map-document (cxml:make-validator x #"foo") d))
Function DOM:MAP-DOCUMENT (handler document &key include-xmlns-attributes include-default-values)
Traverse a DOM document and call SAX functions as if an XML
representation of the document were processed by a SAX parser.
SAX interface
A SAX handler is an arbitrary objects that implements some of the
generic functions in the SAX package. Note that no default
handler class is necessary, because all generic functions have default
methods which do nothing. SAX functions are:
Function SAX:START-DOCUMENT (handler)
Function SAX:END-DOCUMENT (handler)
Function SAX:START-ELEMENT (handler namespace-uri local-name qname attributes)
Function SAX:END-ELEMENT (handler namespace-uri local-name qname)
Function SAX:START-PREFIX-MAPPING (handler prefix uri)
Function SAX:END-PREFIX-MAPPING (handler prefix)
Function SAX:PROCESSING-INSTRUCTION (handler target data)
Function SAX:COMMENT (handler data)
Function SAX:START-CDATA (handler)
Function SAX:END-CDATA (handler)
Function SAX:CHARACTERS (handler data)
Function SAX:START-DTD (handler name public-id system-id)
Function SAX:END-DTD (handler)
Function SAX:UNPARSED-ENTITY-DECLARATION (handler name public-id system-id notation-name)
Function SAX:EXTERNAL-ENTITY-DECLARATION (handler kind name public-id system-id)
Function SAX:INTERNAL-ENTITY-DECLARATION (handler kind name value)
Function SAX:NOTATION-DECLARATION (handler name public-id system-id)
Function SAX:ELEMENT-DECLARATION (handler name model)
Function SAX:ATTRIBUTE-DECLARATION (handler ename aname type default)
Accessor SAX:ATTRIBUTE-PREFIX (attribute)
Accessor SAX:ATTRIBUTE-NAMESPACE-URI (attribute)
Accessor SAX:ATTRIBUTE-LOCAL-NAME (attribute)
Accessor SAX:ATTRIBUTE-VALUE (attribute)
Accessor SAX:ATTRIBUTE-QNAME (attribute)
Accessor SAX:ATTRIBUTE-SPECIFIED-P (attribute)
The entity declaration methods are similar to Java SAX
definitions, but parameter entities are distinguished from
general entities not by a % prefix to the name, but by
the kind argument, either :parameter or
:general.
The arguments to sax:element-declaration and
sax:attribute-declaration differ significantly from their
Java counterparts.
fixme: For more information on these functions refer to the docstrings.
DOM Notes
CXML implements the DOM Level 1 Core interfaces. Explaining
DOM is better left to the specification,
so please refer to the official W3C documents for DOM.
However, there is no "standard" DOM mapping for Lisp. DOM
is specified
in CORBA IDL, but it refrains from using object-oriented IDL
features, allowing for a much more natural Lisp implemenation than
the the ordinary IDL/Lisp mapping would.
Differences between CXML's DOM and the direct IDL/Lisp mapping:
-
DOM function names are symbols in the DOM package (not
the OP package).
-
DOM functions have proper required arguments, not a huge
&rest lambda list.
-
Although most IDL interfaces are implemented as CLOS classes by
CXML, the Lisp types of DOM objects is not documented and cannot
be relied upon. A node's type can be determined using
dom:node-type instead.
-
DOMString is mapped to rod, which is either
an (unsigned-byte 16) array type or a string type.
-
The IDL/Lisp mapping maps CORBA enums to Lisp keywords.
Unfortunately, the DOM IDL does not use enums. Instead,
both exception types and node types are defined integer
constants. CXML chooses to ignore this definition and uses
keywords instead.
-
DOM uses StudlyCaps. Lisp programmers don't. We
insert #\- before every upper case letter preceded by a
lower case letter and before every upper case letter which is
followed by a lower case letter, but preceded by a capital
letter. This algorithms leads to the natural Lisp spelling
of DOM function names.
-
Implementation note: DOM's NodeList does not
necessarily map to a native "sequence" type. (For example,
node lists are objects in Java, not arrays.)
NodeList is specified to reflect changes done after a
node list was created, so node lists cannot be Lisp lists.
(A node list could be implemented as a CLOS object pointing to
said list though.) Instead, CXML currently implements node
lists as adjustable vectors. Note that code which relies on
this implementation and uses Lisp sequence functions
instead of sticking to dom:item and dom:length
is not portable. As a compromise, you can use our
extensions dom:map-node-list or
dom:do-node-list, which can be implemented portably.
Example:
XML(97): (dom:node-type
(dom:document-element
(cxml:parse-file "~/test.xml" (dom:make-dom-builder))))
:ELEMENT