S-XML

S-XML is a simple XML parser implemented in Common Lisp. Originally it was written by Sven Van Caekenberghe. It is now being maintained by Sven Van Caekenberghe, Rudi Schlatte and Brian Mastenbrook. S-XML is used by S-XML-RPC and CL-PREVALENCE.

This XML parser implementation has the following features:

This XML parser implementation has the following limitations:

Download

You can download the LLGPL source code and documentation as s-xml.tgz (signature: s-xml.tgz.asc for which the public key can be found in the common-lisp.net keyring) (build and/or install with ASDF).

You can view the CVS Repository or get anonymous CVS access as follows:

$ cvs -d:pserver:anonymous@common-lisp.net:/project/s-xml/cvsroot login
(Logging in to anonymous@common-lisp.net)
CVS password: anonymous
$ cvs -d:pserver:anonymous@common-lisp.net:/project/s-xml/cvsroot co s-xml

API

The plain API exported by the package S-XML (automatically generated by LispDoc) is available in S-XML.html.

XML Parser

Using a DOM parser is easier, but usually less efficient: see the next sections. To use the event-based API of the parser, you call the function start-parse-xml on a stream, specifying 3 hook functions:

As an example, consider the following tracer that shows how the different hooks are called:

(defun trace-xml-new-element-hook (name attributes seed)
  (let ((new-seed (cons (1+ (car seed)) (1+ (cdr seed)))))
    (trace-xml-log (car seed) 
                   "(new-element :name ~s :attributes ~:[()~;~:*~s~] :seed ~s) => ~s" 
                   name attributes seed new-seed)
    new-seed))

(defun trace-xml-finish-element-hook (name attributes parent-seed seed)
  (let ((new-seed (cons (1- (car seed)) (1+ (cdr seed)))))
    (trace-xml-log (car parent-seed)
                   "(finish-element :name ~s :attributes ~:[()~;~:*~s~] :parent-seed ~s :seed ~s) => ~s" 
                   name attributes parent-seed seed new-seed)
    new-seed))

(defun trace-xml-text-hook (string seed)
  (let ((new-seed (cons (car seed) (1+ (cdr seed)))))
    (trace-xml-log (car seed) 
                   "(text :string ~s :seed ~s) => ~s" 
                   string seed new-seed)
    new-seed))

(defun trace-xml (in)
  "Parse and trace a toplevel XML element from stream in"
  (start-parse-xml in
		   (make-instance 'xml-parser-state
				  :seed (cons 0 0) 
                                  ;; seed car is xml element nesting level
                                  ;; seed cdr is ever increasing from element to element
				  :new-element-hook #'trace-xml-new-element-hook
                                  :finish-element-hook #'trace-xml-finish-element-hook
				  :text-hook #'trace-xml-text-hook)))

This is the output of the tracer on two small XML documents, the seed is a CONS that keeps track of the nesting level in its CAR and of its flow through the hooks with an ever increasing number is its CDR:

S-XML 31 > (with-input-from-string (in "<FOO X='10' Y='20'><P>Text</P><BAR/><H1><H2></H2></H1></FOO>") (trace-xml in))
(new-element :name :FOO :attributes ((:Y . "20") (:X . "10")) :seed (0 . 0)) => (1 . 1)
  (new-element :name :P :attributes () :seed (1 . 1)) => (2 . 2)
    (text :string "Text" :seed (2 . 2)) => (2 . 3)
  (finish-element :name :P :attributes () :parent-seed (1 . 1) :seed (2 . 3)) => (1 . 4)
  (new-element :name :BAR :attributes () :seed (1 . 4)) => (2 . 5)
  (finish-element :name :BAR :attributes () :parent-seed (1 . 4) :seed (2 . 5)) => (1 . 6)
  (new-element :name :H1 :attributes () :seed (1 . 6)) => (2 . 7)
    (new-element :name :H2 :attributes () :seed (2 . 7)) => (3 . 8)
    (finish-element :name :H2 :attributes () :parent-seed (2 . 7) :seed (3 . 8)) => (2 . 9)
  (finish-element :name :H1 :attributes () :parent-seed (1 . 6) :seed (2 . 9)) => (1 . 10)
(finish-element :name :FOO :attributes ((:Y . "20") (:X . "10")) :parent-seed (0 . 0) :seed (1 . 10)) => (0 . 11)
(0 . 11)

S-XML 32 > (with-input-from-string (in "<FOO><UL><LI>1</LI><LI>2</LI><LI>3</LI></UL></FOO>") (trace-xml in))
(new-element :name :FOO :attributes () :seed (0 . 0)) => (1 . 1)
  (new-element :name :UL :attributes () :seed (1 . 1)) => (2 . 2)
    (new-element :name :LI :attributes () :seed (2 . 2)) => (3 . 3)
      (text :string "1" :seed (3 . 3)) => (3 . 4)
    (finish-element :name :LI :attributes () :parent-seed (2 . 2) :seed (3 . 4)) => (2 . 5)
    (new-element :name :LI :attributes () :seed (2 . 5)) => (3 . 6)
      (text :string "2" :seed (3 . 6)) => (3 . 7)
    (finish-element :name :LI :attributes () :parent-seed (2 . 5) :seed (3 . 7)) => (2 . 8)
    (new-element :name :LI :attributes () :seed (2 . 8)) => (3 . 9)
      (text :string "3" :seed (3 . 9)) => (3 . 10)
    (finish-element :name :LI :attributes () :parent-seed (2 . 8) :seed (3 . 10)) => (2 . 11)
  (finish-element :name :UL :attributes () :parent-seed (1 . 1) :seed (2 . 11)) => (1 . 12)
(finish-element :name :FOO :attributes () :parent-seed (0 . 0) :seed (1 . 12)) => (0 . 13)
(0 . 13)

The following example counts tags, attributes and characters:

(defclass count-xml-seed ()
  ((elements :initform 0)
   (attributes :initform 0)
   (characters :initform 0)))

(defun count-xml-new-element-hook (name attributes seed)
  (declare (ignore name))
  (incf (slot-value seed 'elements))
  (incf (slot-value seed 'attributes) (length attributes))
  seed)

(defun count-xml-text-hook (string seed)
  (incf (slot-value seed 'characters) (length string))
  seed)
  
(defun count-xml (in)
  "Parse a toplevel XML element from stream in, counting elements, attributes and characters"
  (start-parse-xml in
		   (make-instance 'xml-parser-state
				  :seed (make-instance 'count-xml-seed)
				  :new-element-hook #'count-xml-new-element-hook
				  :text-hook #'count-xml-text-hook)))

(defun count-xml-file (pathname)
  "Parse XMl from the file at pathname, counting elements, attributes and characters"
  (with-open-file (in pathname)
    (let ((result (count-xml in)))
      (with-slots (elements attributes characters) result
        (format t 
  "~a contains ~d XML elements, ~d attributes and ~d characters.~%"
                pathname elements attributes characters)))))

This example removes XML markup:

(defun remove-xml-markup (in)
  (let* ((state (make-instance 'xml-parser-state
                              :text-hook #'(lambda (string seed) (cons string seed))))
         (result (start-parse-xml in state)))
    (apply #'concatenate 'string (nreverse result))))

The next example is from the xml-element struct DOM implementation, where the SSAX parser hook functions are building the actual DOM:

(defun standard-new-element-hook (name attributes seed)
  (declare (ignore name attributes seed))
  '())

(defun standard-finish-element-hook (name attributes parent-seed seed)
  (let ((xml-element (make-xml-element :name name
				       :attributes attributes
				       :children (nreverse seed))))
    (cons xml-element parent-seed)))

(defun standard-text-hook (string seed)
  (cons string seed))

(defmethod parse-xml-dom (stream (output-type (eql :xml-struct)))
  (car (start-parse-xml stream
			(make-instance 'xml-parser-state
				       :new-element-hook #'standard-new-element-hook
				       :finish-element-hook #'standard-finish-element-hook
				       :text-hook #'standard-text-hook))))

The parse state can be used to specify the initial seed value (nil by default), and the set of known entities (the 5 standard entities (lt, gt, amp, qout, apos) and nbps by default).

DOM

Using a DOM parser is easier, but usually less efficient. Currently three different DOM's are supported:

There is a generic API that is identical for each type of DOM, with an extra parameter input-type or output-type used to specify the type of DOM. The default DOM type is :lxml. Here are some examples:

? (in-package :s-xml)
#<Package "S-XML">

? (setf xml-string "<foo id='top'><bar>text</bar></foo>")
"<foo id='top'><bar>text</bar></foo>"

? (parse-xml-string xml-string)
((:|foo| :|id| "top") (:|bar| "text"))

? (parse-xml-string xml-string :output-type :sxml)
(:|foo| (:@ (:|id| "top")) (:|bar| "text"))

? (parse-xml-string xml-string :output-type :xml-struct)
#S(XML-ELEMENT :NAME :|foo| :ATTRIBUTES ((:|id| . "top"))
               :CHILDREN (#S(XML-ELEMENT :NAME :|bar|
                                         :ATTRIBUTES NIL
                                         :CHILDREN ("text"))))

? (print-xml * :pretty t :input-type :xml-struct)
<foo id="top">
  <bar>text</bar>
</foo>
NIL

? (print-xml '(p "Interesting stuff at " ((a href "http://slashdot.org") "SlashDot")))
<P>Interesting stuff at <A HREF="http://slashdot.org">SlashDot</A></P>
NIL

Tag and attribute names are converted to keywords. Note that XML is case-sensitive, hence the fact that Common Lisp has to resort to the special literal symbol syntax.

Release History and ChangeLog

2006-01-19 Sven Van Caekenberghe 

	* added a set of patches contributed by David Tolpin dvd@davidashen.net : we're now using char of type 
	Character and #\Null instead of null, read/unread instead of peek/read and some more declarations for
	more efficiency - added hooks for customizing parsing attribute names and values

2005-11-20 Sven Van Caekenberghe 

	* added xml prefix namespace as per REC-xml-names-19990114 (by Rudi Schlatte)

2005-11-06 Sven Van Caekenberghe 

	* removed Debian packaging directory (on Luca's request)
	* added CDATA support (patch contributed by Peter Van Eynde pvaneynd@mailworks.org)

2005-08-30 Sven Van Caekenberghe 

	* added Debian packaging directory (contributed by Luca Capello luca@pca.it)
	* added experimental XML namespace support 

2005-02-03 Sven Van Caekenberghe <svc@mac.com>

        * release 5 (cvs tag RELEASE_5)
	* added :start and :end keywords to print-string-xml
	* fixed a bug: in a tag containing whitespace, like <foo> </foo> the parser collapsed 
	  and ingnored all whitespace and considered the tag to be empty!
          this is now fixed and a unit test has been added
	* cleaned up xml character escaping a bit: single quotes and all normal whitespace  
	  (newline, return and tab) is preserved a unit test for this has been added
	* IE doesn't understand the ' XML entity, so I've commented that out for now. 
	  Also, using actual newlines for newlines is probably better than using #xA, 
	  which won't get any end of line conversion by the server or user agent.

June 2004 Sven Van Caekenberghe <svc@mac.com>

	* release 4
	* project moved to common-lisp.net, renamed to s-xml, 
	* added examples counter, tracer and remove-markup, improved documentation

13 Jan 2004 Sven Van Caekenberghe <svc@mac.com>
	
	* release 3
	* added ASDF systems
	* optimized print-string-xml

10 Jun 2003 Sven Van Caekenberghe <svc@mac.com>
	
	* release 2
	* added echo-xml function: we are no longer taking the car when
	  the last seed is returned from start-parse-xml

25 May 2003 Sven Van Caekenberghe <svc@mac.com>
	
	* release 1
	* first public release of working code
	* tested on OpenMCL
	* rewritten to be event-based, to improve efficiency and 
	  to optionally use different DOM representations
	* more documentation

end of 2002 Sven Van Caekenberghe <svc@mac.com>
	
	* release 0
	* as part of an XML-RPC implementation

Todo

Mailing Lists

CVS version $Id: index.html,v 1.12 2006/01/31 11:56:06 scaekenberghe Exp $

Valid XHTML 1.0 Strict Valid CSS