Skip to content
unicode.tex 31.7 KiB
Newer Older
\chapter{Internationalization}
\label{i18n}
\cindex{Internationalization}

\cmucl{} supports internationalization by supporting Unicode
characters internally and by adding support for external formats to
convert from the internal format to an appropriate external character
coding format.

To understand the support for Unicode, we refer the reader to the
\ifpdf
\href{http://www.unicode.org/}{Unicode standard}.
\else
\emph{Unicode standard} at \href{http://www.unicode.org}
\fi
\section{Changes}

To support internationalization, the following changes to Common Lisp
functions have been done.


\subsection{Design Choices}

To support Unicode, there are many approaches.  One choice is to
support both 8-bit \code{base-char} and a 21-bit (or larger)
\code{character} since Unicode codepoints use 21 bits.  This generally
means strings are much larger, and complicates the compiler by having
to support both \code{base-char} and \code{character} types and the
corresponding string types.  This also adds complexity for the user to
understand the difference between the different string and character
types.

Another choice is to have just one character and string type that can
hold the entire Unicode codepoint.  While simplifying the compiler and
reducing the burden on the user, this significantly increases memory
usage for strings.

The solution chosen by \cmucl{} is to tradeoff the size and complexity
by having only 16-bit characters.  Most of the important languages can
be encoded using only 16-bits.  The rest of the codepoints are for
rare languages or ancient scripts.  Thus, the memory usage is
significantly reduced while still supporting the the most important
languages.  Compiler complexity is also reduced since \code{base-char}
and \code{character} are the same as are the string types..  But we
still want to support the full Unicode character set.  This is
achieved by making strings be UTF-16 strings internally.  Hence, Lisp
strings are UTF-16 strings, and Lisp characters are UTF-16 code-units.


\subsection{Characters}
\label{sec:i18n:characters}

Characters are now 16 bits long instead of 8 bits, and \code{base-char}
and \code{character} types are the same.  This difference is
naturally indicated by changing \code{char-code-limit} from 256 to
65536.

\subsection{Strings}
\label{sec:i18n:strings}

In \cmucl{} there is only one type of string---\code{base-string} and
\code{string} are the same.  

Internally, the strings are encoded using UTF-16.  This means that in
some rare cases the number of Lisp characters in a string is not the
same as the number of codepoints in the string.


\section{External Formats}

To be able to communicate to the external world, \cmucl{} supports
external formats to convert to and from the external world to
\cmucl{}'s string format.  The external format is specified in several
ways.  The standard streams \var{*standard-input*},
\var{*standard-output*}, and \var{*standard-error*} take the format
from the value specified by \var{*default-external-format*}.  The
default value of \var{*default-external-format*} is \kwd{iso8859-1}.

For files, \code{OPEN} takes the \kwd{external-format}
parameter to specify the format.  The default external format is
\kwd{default}. 

\subsection{Available External Formats}

The available external formats are listed below in
Table~\ref{table:external-formats}.  The first column gives the
external format, and the second column gives a list of aliases that
can be used for this format.  The set of aliases can be changed by
changing the \file{aliases} file.

For all of these formats, if an illegal sequence is encountered, no
error or warning is signaled.  Instead, the offending sequence is
silently replaced with the Unicode REPLACEMENT CHARACTER (U$+$FFFD).

\begin{table}
  \centering
  \begin{tabular}{|l|l|p{3in}|}
    \hline
    \textbf{Format} & \textbf{Aliases} & \textbf{Description} \\
    \hline
    \hline
    \kwd{iso8859-1} & \kwd{latin1} \kwd{latin-1} \kwd{iso-8859-1} & ISO8859-1 \\
    \hline
    \kwd{iso8859-2} & \kwd{latin2} \kwd{latin-2} \kwd{iso-8859-2} & ISO8859-2 \\
    \hline
    \kwd{iso8859-3} & \kwd{latin3} \kwd{latin-3} \kwd{iso-8859-3} & ISO8859-3 \\
    \hline
    \kwd{iso8859-4} & \kwd{latin4} \kwd{latin-4} \kwd{iso-8859-4} & ISO8859-4 \\
    \hline
    \kwd{iso8859-5} & \kwd{cyrillic} \kwd{iso-8859-5} & ISO8859-5 \\
    \hline
    \kwd{iso8859-6} & \kwd{arabic} \kwd{iso-8859-6} & ISO8859-6 \\
    \hline
    \kwd{iso8859-7} & \kwd{greek} \kwd{iso-8859-7} & ISO8859-7 \\
    \hline
    \kwd{iso8859-8} & \kwd{hebrew} \kwd{iso-8859-8} & ISO8859-8 \\
    \hline
    \kwd{iso8859-9} & \kwd{latin5} \kwd{latin-5} \kwd{iso-8859-9} & ISO8859-9 \\
    \hline
    \kwd{iso8859-10} & \kwd{latin6} \kwd{latin-6} \kwd{iso-8859-10} & ISO8859-10 \\
    \hline
    \kwd{iso8859-13} & \kwd{latin7} \kwd{latin-7} \kwd{iso-8859-13} & ISO8859-13 \\
    \hline
    \kwd{iso8859-14} & \kwd{latin8} \kwd{latin-8} \kwd{iso-8859-14} & ISO8859-14 \\
    \hline
    \kwd{iso8859-15} & \kwd{latin9} \kwd{latin-9} \kwd{iso-8859-15} & ISO8859-15 \\
    \hline
    \kwd{utf-8} & \kwd{utf} \kwd{utf8} & UTF-8 \\
    \hline
    \kwd{utf-16} & \kwd{utf16} & UTF-16 with optional BOM \\
    \hline
    \kwd{utf-16-be} & \kwd{utf-16be} \kwd{utf16-be} & UTF-16 big-endian (without BOM) \\
    \hline
    \kwd{utf-16-le} & \kwd{utf-16le} \kwd{utf16-le} & UTF-16 little-endian (without BOM) \\
    \hline
    \kwd{utf-32} & \kwd{utf32} & UTF-32 with optional BOM \\
    \hline
    \kwd{utf-32-be} & \kwd{utf-32be} \kwd{utf32-be} & UTF-32 big-endian (without BOM) \\
    \hline
    \kwd{utf-32-le} & \kwd{utf-32le} \kwd{utf32-le} & UTF-32 little-endian (without BOM) \\
    \hline
    \kwd{cp1250} & & \\
    \hline
    \kwd{cp1251} & & \\
    \hline
    \kwd{cp1252} & \kwd{windows-1252} \kwd{windows-cp1252} \kwd{windows-latin1} & \\
    \hline
    \kwd{cp1253} & & \\
    \hline
    \kwd{cp1254} & & \\
    \hline
    \kwd{cp1255} & & \\
    \hline
    \kwd{cp1256} & & \\
    \hline
    \kwd{cp1257} & & \\
    \hline
    \kwd{cp1258} & & \\
    \hline
    \kwd{koi8-r} & & \\
    \hline
    \kwd{mac-cyrillic} & & \\
    \hline
    \kwd{mac-greek} & & \\
    \hline
    \kwd{mac-icelandic} & & \\
    \hline
    \kwd{mac-latin2} & & \\
    \hline
    \kwd{mac-roman} & & \\
    \hline
    \kwd{mac-turkish} & & \\
    \hline
  \end{tabular}
  \caption{External formats}
  \label{table:external-formats}
\end{table}

\subsection{Composing External Formats}

A composing external format is an external format that converts between
one codepoint and another, rather than between codepoints and octets.
A composing external format must be used in conjunction with another
(octet-producing) external format.  This is specified by
using a list as the external format.  For example, we can use
\code{'(\kwd{latin1} \kwd{crlf})} as the external format. In this
particular example, the external format is latin1, but whenever a
carriage-return/linefeed sequence is read, it is converted to the Lisp
\lispchar{Newline} character.  Conversely, whenever a string is written,
a Lisp \lispchar{Newline} character is converted to a
carriage-return/linefeed sequence.  Without the \kwd{crlf} composing
format, the carriage-return and linefeed will be read in as separate
characters, and on output the Lisp \lispchar{Newline} character is
output as a single linefeed character.

Table~\ref{table:composing-formats} lists the available composing formats.

\begin{table}
  \centering
  \begin{tabular}{|l|l|p{3in}|}
    \hline
    \textbf{Format} & \textbf{Aliases} & \textbf{Description} \\
    \hline
    \hline
    \kwd{crlf} & \kwd{dos} & Composing format for converting to/from DOS (CR/LF)
    end-of-line sequence to \lispchar{Newline}\\
    \kwd{cr} & \kwd{mac} & Composing format for converting to/from DOS (CR/LF)
    end-of-line sequence to \lispchar{Newline}\\
    \hline
    \kwd{beta-gk} & & Composing format that translates (lower-case) Beta
    code (an ASCII encoding of ancient Greek) \\
    \hline
    \kwd{final-sigma} & & Composing format that attempts to detect sigma in
    word-final position and change it from U+3C3 to U+3C2\\
    \hline
  \end{tabular}
  \caption{Composing external formats}
  \label{table:composing-formats}
\end{table}

\section{Dictionary}

\subsection{Variables}

\begin{defvar}{extensions:}{default-external-format}
   This is the default external format to use for all newly opened
   files.  It is also the default format to use for
   \var{*standard-input*}, \var{*standard-output*}, and
   \var{*standard-error*}.  The default value is \kwd{iso8859-1}.

   Setting this will cause the standard streams to start using the new
   format immediately.  If a stream has been created with external
   format \kwd{default}, then setting \var{*default-external-format*}
   will cause all subsequent input and output to use the new value of
   \var{*default-external-format*}.
\end{defvar}
\subsection{Characters}

Remember that \cmucl{}'s characters are only 16-bits long but Unicode
codepoints are up to 21 bits long.  Hence there are codepoints that
cannot be represented via Lisp characters.  Operating on individual
characters is not recommended.  Operations on strings are better.
(This would be true even if \cmucl{}'s characters could hold a
full Unicode codepoint.)

\begin{defun}{}{char-equal}{\amprest{} \var{characters}}
   \defunx{char-not-equal}{\amprest{} \var{characters}}
   \defunx{char-lessp}{\amprest{} \var{characters}}
   \defunx{char-greaterp}{\amprest{} \var{characters}}
   \defunx{char-not-greaterp}{\amprest{} \var{characters}}
   \defunx{char-not-lessp}{\amprest{} \var{characters}}
   For the comparison, the characters are converted to lowercase and
   the corresponding \code{char-code} are compared.
\end{defun}

\begin{defun}{}{alpha-char-p}{\args \var{character}}
  Returns non-nil{} if the Unicode category is a letter category.
\end{defun}

\begin{defun}{}{alphanumericp}{\args \var{character}}
  Returns non-nil{} if the Unicode category is a letter category or an ASCII
  digit.
\end{defun}

\begin{defun}{}{digit-char-p}{\args \var{character} \ampoptional{} \var{radix}}
   Only recognizes ASCII digits (and ASCII letters if the radix is larger
   than 10).
\end{defun}

\begin{defun}{}{graphic-char-p}{\args \var{character}}
  Returns non-nil{} if the Unicode category is a graphic category.
\end{defun}

\begin{defun}{}{upper-case-p}{\args \var{character}}
  \defunx{lower-case-p}{\args \var{character}}
  Returns non-nil{} if the Unicode category is an uppercase
  (lowercase) character.
\end{defun}

\begin{defun}{lisp:}{title-case-p}{\args \var{character}}
  Returns non-nil{} if the Unicode category is a titlecase character.
\end{defun}

\begin{defun}{}{both-case-p}{\args \var{character}}
  Returns non-nil{} if the Unicode category is an uppercase,
  lowercase, or titlecase character.
\end{defun}

\begin{defun}{}{char-upcase}{\args \var{character}}
  \defunx{char-downcase}{\args \var{character}}
  The Unicode uppercase (lowercase) letter is returned.
\end{defun}

\begin{defun}{lisp:}{char-titlecase}{\args \var{character}}
  The Unicode titlecase letter is returned.
\end{defun}

\begin{defun}{}{char-name}{\args \var{char}}
   If possible the name of the character \var{char} is returned.  If
   there is a Unicode name, the Unicode name is returned, except
   spaces are converted to underscores and the string is capitalized
   via \code{string-capitalize}.  If there is no Unicode name, the
   form \lispchar{U+xxxx} is returned where ``xxxx'' is the
   \code{char-code} of the character, in hexadecimal.
\end{defun}

\begin{defun}{}{name-char}{\args \var{name}}
  The inverse to \code{char-name}.  If no character has the name
  \var{name}, then \nil{} is returned.  Unicode names are not
  case-sensitive, and spaces and underscores are optional.
\end{defun}
\subsection{Strings}

Strings in \cmucl{} are UTF-16 strings.  That is, for Unicode code
points greater than 65535, surrogate pairs are used.  We refer the
reader to the Unicode standard for more information about surrogate
pairs.  We just want to make a note that because of the UTF-16
encoding of strings, there is a distinction between Lisp characters
and Unicode codepoints.  The standard string operations know about
this encoding and handle the surrogate pairs correctly.


\begin{defun}{}{string-upcase}{\args \var{string} \keys{\kwd{start}
      \kwd{end} \kwd{casing}}}
  \defunx{string-downcase}{\args \var{string} \keys{\kwd{start}
      \kwd{end} \kwd{casing}}}
  \defunx{string-capitalize}{\args \var{string} \keys{\kwd{start}
      \kwd{end} \kwd{casing}}}
  The case of the \var{string} is changed appropriately.  Surrogate
  pairs are handled correctly.  The conversion to the appropriate case
  is done based on the Unicode conversion.  The additional argument
  \kwd{casing} controls how case conversion is done.  The default
  value is \kwd{simple}, which uses simple Unicode case conversion.
  If \kwd{casing} is \kwd{full}, then full Unicode case conversion is
  done where the string may actually increase in length.
\end{defun}

\begin{defun}{}{nstring-upcase}{\args \var{string} \keys{\kwd{start} \kwd{end}}}
  \defunx{nstring-downcase}{\args \var{string} \keys{\kwd{start} \kwd{end}}}
  \defunx{nstring-capitalize}{\args \var{string} \keys{\kwd{start}
      \kwd{end}}}
  The case of the \var{string} is changed appropriately.  Surrogate
  pairs are handled correctly.  The conversion to the appropriate case
  is done based on the Unicode conversion.  (Full casing is not
  available because the string length cannot be increased when needed.)
\end{defun}

\begin{defun}{}{string=}{\args \var{s1} \var{s2} \keys{\kwd{start1}
      \kwd{end1} \kwd{start2} \kwd{end2}}}
  \defunx{string/=}{\args \var{s1} \var{s2} \keys{\kwd{start1} \kwd{end1} \kwd{start2} \kwd{end2}}}
  \defunx{string$<$}{\args \var{s1} \var{s2} \keys{\kwd{start1} \kwd{end1} \kwd{start2} \kwd{end2}}}
  \defunx{string$>$}{\args \var{s1} \var{s2} \keys{\kwd{start1} \kwd{end1} \kwd{start2} \kwd{end2}}}
  \defunx{string$<$=}{\args \var{s1} \var{s2} \keys{\kwd{start1} \kwd{end1} \kwd{start2} \kwd{end2}}}
  \defunx{string$>$=}{\args \var{s1} \var{s2} \keys{\kwd{start1} \kwd{end1} \kwd{start2} \kwd{end2}}}
  The string comparison is done in codepoint order.  (This is
  different from just comparing the order of the individual characters
  due to surrogate pairs.)  Unicode collation is not done.
\end{defun}

\begin{defun}{}{string-equal}{\args \var{s1} \var{s2} \keys{\kwd{start1}
      \kwd{end1} \kwd{start2} \kwd{end2}}}
  \defunx{string-not-equal}{\args \var{s1} \var{s2} \keys{\kwd{start1} \kwd{end1} \kwd{start2} \kwd{end2}}}
  \defunx{string-lessp}{\args \var{s1} \var{s2} \keys{\kwd{start1} \kwd{end1} \kwd{start2} \kwd{end2}}}
  \defunx{string-greaterp}{\args \var{s1} \var{s2} \keys{\kwd{start1} \kwd{end1} \kwd{start2} \kwd{end2}}}
  \defunx{string-not-greaterp}{\args \var{s1} \var{s2} \keys{\kwd{start1} \kwd{end1} \kwd{start2} \kwd{end2}}}
  \defunx{string-not-lessp}{\args \var{s1} \var{s2} \keys{\kwd{start1} \kwd{end1} \kwd{start2} \kwd{end2}}}
  Each codepoint in each string is converted to lowercase and the
  appropriate comparison of the codepoint values is done.  Unicode
  collation is not done.
\end{defun}

\begin{defun}{}{string-left-trim}{\args \var{bag} \var{string}}
  \defunx{string-right-trim}{\args \var{bag} \var{string}}
  \defunx{string-trim}{\args \var{bag} \var{string}}
  Removes any characters in \code{bag} from the left, right, or both
  ends of the string \code{string}, respectively.  This has potential
  problems if you want to remove a surrogate character from the
  string, since a single character cannot represent a surrogate.  As
  an extension, if \code{bag} is a string, we properly handle
  surrogate characters in the \code{bag}.
\end{defun}

\subsection{Sequences}

Since strings are also sequences, the sequence functions can be used
on strings.  We note here some issues with these functions.  Most
issues are due to the fact that strings are UTF-16 strings and
characters are UTF-16 code units, not Unicode codepoints.

\begin{defun}{}{remove-duplicates}{\args \var{sequence}
    \keys{\kwd{from-end} \kwd{test} \kwd{test-not} \kwd{start}
      \kwd{end} \kwd{key}}}
  \defunx{delete-duplicates}{\args \var{sequence}
    \keys{\kwd{from-end} \kwd{test} \kwd{test-not} \kwd{start}
      \kwd{end} \kwd{key}}}
  Because of surrogate pairs these functions may remove a high or low
  surrogate value, leaving the string in an invalid state.  Use these
  functions carefully with strings.
\end{defun}


\subsection{Reader}

To support Unicode characters, the reader has been extended to
recognize characters written in hexadecimal.  Thus \lispchar{U+41} is
the ASCII capital letter ``A'', since 41 is the hexadecimal code for
that letter.  The Unicode name of the character is also recognized,
except spaces in the name are replaced by underscores.

Recall, however, that characters in \cmucl{} are only 16 bits long so
many Unicode characters cannot be represented.  However, strings can
represent all Unicode characters.

When symbols are read, the symbol name is converted to Unicode NFC
form before interning the symbol into the package.  Hence,
\code{symbol-name (intern ``string'')} may produce a string that is
not \code{string=} to ``string''.  However, after conversion to NFC
form, the strings will be identical.

\subsection{Printer}

When printing characters, if the character is a graphic character, the
character is printed.  Thus \lispchar{U+41} is printed as
\lispchar{A}.  If the character is not a graphic character, the Lisp
name (e.g., \lispchar{Tab}) is used if possible;
if there is no Lisp name, the Unicode name is used.  If there is no
Unicode name, the hexadecimal char-code is
printed.  For example, \lispchar{U+34e}, which is not a graphic
character, is printed as \lispchar{Combining\_Upwards\_Arrow\_Below},
and \lispchar{U+9f} which is not a graphic character and does not have a
Unicode name, is printed as \lispchar{U+009F}.

\subsection{Miscellaneous}


\subsubsection{Files}

\cmucl{} loads external formats using the search-list
\file{ext-formats:}.  The \file{aliases} file is also located using
this search-list.

The Unicode data base is stored in compressed form in the file
\file{ext-formats:unidata.bin}.  If this file is not found, Unicode
support is severely reduced; you can only use ASCII characters.

\begin{defun}{}{open}{\args \var{filename} \amprest \var{options}
    \keys{\kwd{direction} \kwd{element-type} \kwd{if-exists}
      \kwd{if-does-not-exist} \morekeys \kwd{class} \kwd{mapped}
      \kwd{input-handle} \kwd{output-handle}
      \yetmorekeys \kwd{external-format} \kwd{decoding-error}
rtoy's avatar
rtoy committed
      \kwd{encoding-error}}}

    The main options are covered elsewhere.  Here we describe the
    options specific to Unicode.  The option \kwd{external-format}
    specifies the external format to use for reading and writing the
    file.  The external format is a keyword.

    The options \kwd{decoding-error} and \kwd{encoding-error} are used
    to specify how encoding and decoding errors are handled.  The
    default value on \nil means the external format handles errors
    itself and typically replaces invalid sequences with the Unicode
    replacement character.

    Otherwise, the value for \code{decoding-error} is either a
    character, a symbol or a function.  If a character is
    specified. it is used as the replacement character for any invalid
    decoding.  If a symbol or a function is given, it must be a
    function of three arguments: a message string to be printed, the
    offending octet, and the number of octets read.  If the function
    returns, it should return two values: the code point to use as the
    replacement character and the number of octets read.  In addition,
    \true{} may be specified.  This indicates that a continuable error
    is signaled, which, if continued, the Unicode replacement
    character is used.

    For \code{encoding-error}, a character, symbol, or function can be
    specified, like \code{decoding-error}, with the same meaning.  The
    function, however, takes two arguments:  a format message string
    and the incorrect codepoint.  If the function returns, it should
    be the replacement codepoint.
\end{defun}    
      
\subsubsection{Utilities}

\begin{defun}{stream:}{set-system-external-format}{\var{terminal} \ampoptional{} \var{filenames}}
  This function changes the external format used for
  \var{*standard-input*}, \var{*standard-output*}, and
  \var{*standard-error*} to the external format specified by
  \var{terminal}.  Additionally, the Unix file name encoding can be
  set to the value specified by \var{filenames} if non-\nil.
\end{defun}

\begin{defun}{extensions:}{list-all-external-formats}{}
  list all of the vailable external formats.  A list is returned where
  each element is a list of the external format name and a list of
  aliases for the format.  No distinction is made between external
  formats and composing external formats.
\end{defun}

\begin{defun}{extensions:}{describe-external-format}{external-format}
  Print a description of the given \var{external-format}.  This may
  cause the external format to be loaded (silently) if it is not
  already loaded.
\end{defun}

Since strings are UTF-16 and hence may contain surrogate pairs, some
utility functions are provided to make access easier.

\begin{defun}{lisp:}{codepoint}{\args \var{string} \var{i}
    \ampoptional{} \var{end}}
  Return the codepoint value from \var{string} at position \var{i}.
  If code unit at that position is a surrogate value, it is combined
  with either the previous or following code unit (when possible) to
  compute the codepoint.  The first return value is the codepoint
  itself.  The second return value is \nil{} if the position is not a
  surrogate pair.  Otherwise, $+1$ or $-1$ is returned if the position
  is the high (leading) or low (trailing) surrogate value, respectively.

  This is useful for iterating through a string in codepoint sequence.
\end{defun}

\begin{defun}{lisp:}{surrogates-to-codepoint}{\args \var{hi} \var{lo}}
  Convert the given \var{hi} and \var{lo} surrogate characters to the
  corresponding codepoint value
\end{defun}

\begin{defun}{lisp:}{surrogates}{\args \var{codepoint}}
  Convert the given \var{codepoint} value to the corresponding high
  and low surrogate characters.  If the codepoint is less than 65536,
  the second value is \nil{} since the codepoint does not need to be
  represented as a surrogate pair.
\end{defun}

\begin{defun}{stream:}{string-encode}{\args \var{string}
    \var{external-format} \ampoptional{} (\var{start} 0) \var{end}}
  \code{string-encode} encodes \var{string} using the format
rtoy's avatar
rtoy committed
  \var{external-format}, producing an array of octets.  Each octet is
  converted to a character via \code{code-char} and the resulting
  string is returned.

  The optional argument \var{start}, defaulting to 0, specifies the
  starting index and \var{end}, defaulting to the length of the
  string, is the end of the string.
\end{defun}

\begin{defun}{stream:}{string-decode}{\args \var{string}
    \var{external-format} \ampoptional{} (\var{start} 0) \var{end}}
  \code{string-decode} decodes \var{string} using the format
  \var{external-format} and produces a new string.  Each character of
  \var{string} is converted to octet (by \code{char-code}) and the
  resulting array of octets is used by the external format to produce
  a string.  This is the inverse of \code{string-encode}.

  The optional argument \var{start}, defaulting to 0, specifies the
  starting index and \var{end}, defaulting to the length of the
  string, is the end of the string.

  \var{string} must consist of characters whose \code{char-code} is
  less than 256.
\end{defun}

\begin{defun}{}{string-to-octets}{\args \var{string} \keys{\kwd{start}
      \kwd{end} \kwd{external-format} \kwd{buffer} \kwd{buffer-start}
      \kwd{error}}}
  \code{string-to-octets} converts \var{string} to a sequence of
  octets according to the external format specified by
  \var{external-format}.  The string to be converted is bounded by
  \var{start}, which defaults to 0, and \var{end}, which defaults to
  the length of the string.  If \var{buffer} is specified, the octets
  are placed in \var{buffer}.  If \var{buffer} is not specified, a new
  array is allocated to hold the octets.  \var{buffer-start} specifies
  where in the buffer the first octet will be placed.

  An error method may also be specified by \var{error}.  Any errors
  encountered while converting the string to octets will be handled
  according to error.  If \nil{}, a replacement character is converted
  to octets in place of the error.  Otherwise, \var{error} should be a
  symbol or function that will be called when the error occurs.  The
  function takes two arguments:  an error string and the character
  that caused the error.  It should return a replacement character.
  
  Three values are returned: The buffer, the number of valid octets
  written, and the number of characters converted.  Note that the
  actual number of octets written may be greater than the returned
  value, These represent the partial octets of the next character to
  be converted, but there was not enough room to hold the complete set
  of octets.
\end{defun}

\begin{defun}{}{octets-to-string}{\args \var{octets} \keys{\kwd{start}
      \kwd{end} \kwd{external-format} \kwd{string} \kwd{s-start}
      \kwd{s-end} \kwd{state}}}
  \code{octets-to-string} converts the sequence of octets in
  \var{octets} to a string.  \var{octets} must be a
  \code{(simple-array (unsigned-byte 8) (*))}.  The octets to be
  converted are bounded by \var{start} and \var{end}, which default to
  0 and the length of the array, respectively.  The conversion is
  performed according to the external format specified by
rtoy's avatar
rtoy committed
  \var{external-format}.  If \var{string} is specified, the octets are
  converted and stored in \var{string}, starting at \var{s-start}
  (defaulting to 0) and ending just before \var{s-end} (defaulting to
  the end of \var{string}.  \var{string} must be \code{simple-string}.
  If the bounded string is not large enough to hold all of the
  characters, then some octets will not be converted.  If \var{string}
  is not specified, a new string is created.

  The \var{state} is used as the initial state of for the external
  format.  This is useful when converting buffers of octets where the
  buffers are not on character boundaries, and state information is
  needed between buffers.

  Four values are returned: the string, the number of characters
rtoy's avatar
rtoy committed
  written to the string, and the number of octets consumed to produce
  the characters, and the final state of external format after
  converting the octets.
\section{Writing External Formats}

\subsection{External Formats}
Users may write their own external formats.  It is probably easiest to
look at existing external formats to see how do this.

An external format basically needs two functions:
\code{octets-to-code} to convert octets to Unicode codepoints and
\code{code-to-octets} to convert Unicode codepoints to octets.  The
external format is defined using the macro
\code{stream::define-external-format}.

\begin{defmac}[base]{stream:}{define-external-format}{\args \var{name}
    (\keys{\var{base} \var{min} \var{max} \var{size}
      \var{documentation}})
    (\amprest{} \var{slots})
    \morekeys{\var{octets-to-code} \var{code-to-octets}
              \var{flush-state} \var{copy-state}}}


  If \kwd{base} is not given, this defines a new external format of
  the name \kwd{name}. \var{min}, \var{max}, and \var{size} are the
  minimum and maximum number of octets that make up a character.
  (\code{\kwd{size} n} is just a short cut for \code{\kwd{min} n
    \kwd{max} n}.)  The description of the external format can be
  given using \kwd{documentation}.  The arguments \var{octets-to-code}
  and \var{code-to-octets} are not optional in this case.  They
  specify how to convert octets to codepoints and vice versa,
  respectively.  These should be backquoted forms for the body of a
  function to do the conversion.  See the description below for these
  functions.  Some good examples are the external format for
  \kwd{utf-8} or \kwd{utf-16}.  The \kwd{slots} argument is a list of
  read-only slots, similar to defstruct.  The slot names are available
  as local variables inside the \var{code-to-octets} and
  \var{octets-to-code} bodies. 
  If \kwd{base} is given, then an external format is defined with the
  name \kwd{name} that is based on a previously defined format
  \kwd{base}. The \var{slots} are inherited from the \kwd{base} format
  by default, although the definition may alter their values and add
  new slots. See, for example, the \kwd{mac-greek} external format. 
\begin{defmac}{}{octets-to-code}{\args \var{state} \var{input}
    \var{unput} \var{error} \amprest{} \var{args}}
  This defines a form to be used by an external format to convert
  octets to a code point.  \var{state} is a form that can be used by
  the body to access the state variable of a stream.  This can be used
  for any reason to hold anything needed by \code{octets-to-code}.
  \var{input} is a form that returns one octet from the input stream.
  \var{unput} will put back \var{N} octets to the stream.  \var{args} is a
  list of variables that need to be defined for any symbols in the
  body of the macro.

  \var{error} controls how errors are handled.  If \nil, some suitable
  replacement character is used.  That is, any errors are silently
  ignored and replaced by some replacement character.  If non-\nil,
  \var{error} is a symbol or function that is called to handle the
  error.  This function takes three arguments: a message string, the
  invalid octet (or \nil), and a count of the number of octets that
  have been read so far.  If the function returns, it should be the
  codepoint of the desired replacement character.
\end{defmac}

\begin{defmac}{}{code-to-octets}{\args \var{code} \var{state}
    \var{output} \var{error} \amprest{} \var{args}}
  Defines a form to be used by the external format to convert a code
  point to octets for output.  \var{code} is the code point to be
  converted.  \var{state} is a form to access the current value of the
  stream's state variable.  \var{output} is a form that writes one
  octet to the output stream.

  Similar to \code{octets-to-code}, \var{error} indicates how errors
  should be handled.  If \nil, some default replacement character is
  substituted.  If non-\nil, \var{error} should be a symbol or
  function.   This function takes two arguments:  a message string and
  the invalid codepoint.  If the function returns, it should be the
  codepoint that will be substituted for the invalid codepoint.
\end{defmac}

\begin{defmac}{}{flush-state}{\args \var{state}
    \var{output} \var{error} \amprest{} \var{args}}
  Defines a form to be used by the external format to flush out
  any state when an output stream is closed.  Similar to
  \code{code-to-octets}, but there is no code point to be output.  The
  \var{error} argument indicates how to handle errors.  If \nil, some
  default replacement character is used.  Otherwise, \var{error} is a
  symbol or function that will be called with a message string and
  codepoint of the offending state.  If the function returns, it
  should be the codepoint of a suitable replacement.

  If \code{flush-state} is \false, then nothing special is needed to
  flush the state to the output.

  This is called only when an output character stream is being closed.
\end{defmac}

\begin{defmac}{}{copy-state}{\args \var{state} \amprest{} \var{args}}
  Defines a form to copy any state needed by the external format.
  This should probably be a deep copy so that if the original
  state is modified, the copy is not.

  If not given, then nothing special is needed to copy the state
  either because there is no state for the external format or that no
  special copier is needed.
\end{defmac}

\subsection{Composing External Formats}

\begin{defmac}{stream:}{define-composing-external-format}{\args \var{name}
    (\keys{\var{min} \var{max} \var{size} \var{documentation}}) \var{input}
    \var{output}}
  This is the same as \code{define-external-format}, except that a
  composing external format is created.
\end{defmac}