Closure XML Parser
An XML parser written in Common Lisp.
Closure XML was written by Gilbert Baumann
(unk6 at rz.uni-karlsruhe.de) as part of the Closure web
browser.
Contributions to the parser by
-
Henrik Motakef (hmot at henrik-motakef.de)
(SAX layer; namespace support)
-
David Lichteblau at knowledgeTools (david at knowledgetools.de)
(conversion into an independent package; DOM bug fixing)
Mailing list cxml-devel
is hosted on common-lisp.net.
Download.
There is no CVS repository (yet).
You can check out David's tla archive at http://www.common-lisp.net/project/cxml/david@knowledgetools.de--cxml/.
There will also be tarballs.
Contents
CXML Modules
CXML provides three packages:
-
RUNES, a portable implementation of Unicode strings.
-
XML, a namespace-aware SAX parser implementing the XML 1.0
specification.
-
DOM, an implementation of the DOM
Level 1 Core interfaces.
Installation
CXML is written in Common Lisp and should be portable to all
Common Lisp implementations. Currently known to work are
ACL, SBCL, CMUCL, and CLISP. (CLISP needs some -E option
teaching it to accept non-ASCII source files.)
ASDF is used for
compilation. The following instructions assume that ASDF has
already been loaded.
Configuration (optional).
CXML has full Unicode code support -- even on Lisps without
Unicode strings. On non-unicode aware Lisps, DOMString
is implemented as an array of character codes. If your Lisp
supports 16 bit characters natively, you can enable feature
RUNE-IS-CHARACTER to select an alternative
DOMString implementatation, which uses real characters
instead of characters codes.
* (pushnew :rune-is-character *features*)
Compiling and loading CXML.
Register the .asd file, e.g. by symlinking it:
$ ln -sf `pwd`/cxms.asd /path/to/your/registry
Then compile CXML using:
* (asdf:operate 'asdf:load-op :cxml)
Tests
Check out the XML and DOM testsuites:
$ export CVSROOT=:pserver:anonymous@dev.w3.org:/sources/public
$ cvs login # password is "anonymous"
$ cvs co 2001/XML-Test-Suite/xmlconf
$ cvs co 2001/DOM-Test-Suite
Usage and expected output:
* (xmlconf:run-all-tests "/path/to/2001/XML-Test-Suite/xmlconf/")
22/389 tests failed; 1773 tests were skipped
* (domtest:run-all-tests "/path/to/2001/2001/DOM-Test-Suite/")
0/440 tests failed; 81 tests were skipped
Most XML testsuite failures are due to document type declarations
which are read by CXML, but not written when the document is
serialized again. This needs work.
fixme: Add an explanation of xml/sax-tests here.
fixme My parser does not understand the current testsuite
anymore. To fix this problem, revert the affected files
manually after check-out:
$ cd 2001/XML-Test-Suite/xmlconf/
xmltest$ patch -p0 -R </path/to/cxml/test/xmlconf-base.diff
The log message for the changes reads "Removed unnecessary
xml:base attribute". If I understand correctly, only
DOM 3 parsers provide the baseURI attribute necessary for
understanding xmlconf.xml now. We don't have that
yet.
To do
-
David's changes might have affected performance. Some
benchmarking needs to be done here.
-
DOM in general is pretty heavyweight. There is/was a
"simple-dom" which should be faster. This might be worth
reviving.
-
For those who don't like DOM at all, it would be a very simple
exercise to write a SAX handler for "Lisp-XML" output instead of
DOM. Other ideas include Erik Naggum's quads.
-
The serializer supports only Canonical
XML right now. In the future we want support for:
Including doctype declarations in the output, ordinary output
with less entity noise, optional indentation, user-specified
encoding, etc.
-
There are still thread-safety issues.
-
Validation!
Using the parser
Function XML:PARSE-FILE (pathname handler)
Function XML:PARSE-STREAM (stream handler)
Function XML:PARSE-OCTETS (octets handler)
Parse an XML document. Arguments:
- pathname -- a Common Lisp pathname
- stream -- a Common Lisp stream with element-type
(unsigned-byte 8)
- octets -- an (unsigned-byte 8) array
- handler -- a SAX handler
Return values from this function depend on the SAX handler used.
Function DOM:MAKE-DOM-BUILDER ()
Create a SAX handler which builds a DOM document. Example:
(xml:parse-file "test.xml" (dom:make-dom-builder))
Function XML:UNPARSE-DOCUMENT (document stream)
Function XML:UNPARSE-DOCUMENT-TO-OCTETS (document) => vector
Serialize a document into canonical
form.
- document -- a DOM document object
- stream -- a Common Lisp stream with element-type
character
unparse-document-to-octets returns an (unsigned-byte
8) array, whereas unparse-document writes
characters. unparse-document is useful together
with with-output-to-string. However, note that the
resulting document in both cases is UTF-8 encoded, so the
characters written by unparse-document are really UTF-8
bytes encoded as characters.
SAX interface
A SAX handler is an arbitrary objects that implements some of the
generic functions in the SAX package. Note that no default
handler class is necessary, because all generic functions have default
methods which do nothing. SAX functions are:
Function SAX:START-DOCUMENT (handler)
Function SAX:START-ELEMENT (handler namespace-uri local-name qname attributes)
Function SAX:START-PREFIX-MAPPING (handler prefix uri)
Function SAX:CHARACTERS (handler data)
Function SAX:PROCESSING-INSTRUCTION (handler target data)
Function SAX:END-PREFIX-MAPPING (handler prefix)
Function SAX:END-ELEMENT (handler namespace-uri local-name qname attributes)
Function SAX:END-DOCUMENT (handler)
Function SAX:COMMENT (handler data)
Function SAX:START-CDATA (handler)
Function SAX:END-CDATA (handler)
Function SAX:START-DTD (handler name public-id system-id)
Function SAX:END-DTD (handler)
fixme: For information on these functions refer to the docstrings.
fixme: Entity and notation processing isn't quite right yet.
DOM Notes
CXML implements the DOM Level 1 Core interfaces. Explaining
DOM is better left to the specification,
so please refer to the official W3C documents for DOM.
However, there is no "standard" DOM mapping for Lisp. DOM
is specified
in CORBA IDL, but it refrains from using object-oriented IDL
features, allowing for a much more natural Lisp implemenation than
the the ordinary IDL/Lisp mapping would.
Differences between CXML's DOM and the direct IDL/Lisp mapping:
-
DOM function names are symbols in the DOM package (not
the OP package).
-
DOM functions have proper required arguments, not a huge
&rest lambda list.
-
Although most IDL interfaces are implemented as CLOS classes by
CXML, the Lisp types of DOM objects is not documented and cannot
be relied upon. A node's type can be determined using
dom:node-type instead.
-
DOMString is mapped to rod, which is either
an (unsigned-byte 16) array type or a string type.
-
The IDL/Lisp mapping maps CORBA enums to Lisp keywords.
Unfortunately, the DOM IDL does not use enums. Instead,
both exception types and node types are defined integer
constants. CXML chooses to ignore this definition and uses
keywords instead.
-
DOM uses StudlyCaps. Lisp programmers don't. We
insert #\- before every upper case letter preceded by a
lower case letter and before every upper case letter which is
followed by a lower case letter, but preceded by a capital
letter. This algorithms leads to the natural Lisp spelling
of DOM function names.
-
Implementation note: DOM's NodeList does not
necessarily map to a native "sequence" type. (For example,
node lists are objects in Java, not arrays.)
NodeList is specified to reflect changes done after a
node list was created, so node lists cannot be Lisp lists.
(A node list could be implemented as a CLOS object pointing to
said list though.) Instead, CXML currently implements node
lists as adjustable vectors. Note that code which relies on
this implementation and uses Lisp sequence functions
instead of sticking to dom:item and dom:length
is not portable. As a compromise, you can use our
extensions dom:map-node-list or
dom:do-node-list, which can be implemented portably.
Example:
XML(97): (dom:node-type
(dom:document-element
(xml:parse-file "~/test.xml" (dom:make-dom-builder))))
:ELEMENT