Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
\chapter{Internationalization}
\label{i18n}
\cindex{Internationalization}
\cmucl{} supports internationalization by supporting Unicode
characters internally and by adding support for external formats to
convert from the internal format to an appropriate external character
coding format.
To understand the support for Unicode, we refer the reader to the
\ifpdf
\href{http://www.unicode.org/}{Unicode standard}.
\else
\emph{Unicode standard} at \href{http://www.unicode.org}
\fi
\section{Changes}
To support internationalization, the following changes to Common Lisp
functions have been done.
\subsection{Design Choices}
To support Unicode, there are many approaches. One choice is to
support both 8-bit \code{base-char} and a 21-bit (or larger)
\code{character} since Unicode codepoints use 21 bits. This generally
means strings are much larger, and complicates the compiler by having
to support both \code{base-char} and \code{character} types and the
corresponding string types. This also adds complexity for the user to
understand the difference between the different string and character
types.
Another choice is to have just one character and string type that can
hold the entire Unicode codepoint. While simplifying the compiler and
reducing the burden on the user, this significantly increases memory
usage for strings.
The solution chosen by \cmucl{} is to tradeoff the size and complexity
by having only 16-bit characters. Most of the important languages can
be encoded using only 16-bits. The rest of the codepoints are for
rare languages or ancient scripts. Thus, the memory usage is
significantly reduced while still supporting the the most important
languages. Compiler complexity is also reduced since \code{base-char}
and \code{character} are the same as are the string types.. But we
still want to support the full Unicode character set. This is
achieved by making strings be UTF-16 strings internally. Hence, Lisp
strings are UTF-16 strings, and Lisp characters are UTF-16 code-units.
\subsection{Characters}
\label{sec:i18n:characters}
Characters are now 16 bits long instead of 8 bits, and \code{base-char}
and \code{character} types are the same. This difference is
naturally indicated by changing \code{char-code-limit} from 256 to
65536.
\subsection{Strings}
\label{sec:i18n:strings}
In \cmucl{} there is only one type of string---\code{base-string} and
\code{string} are the same.
Internally, the strings are encoded using UTF-16. This means that in
some rare cases the number of Lisp characters in a string is not the
same as the number of codepoints in the string.
\section{External Formats}
To be able to communicate to the external world, \cmucl{} supports
external formats to convert to and from the external world to
\cmucl{}'s string format. The external format is specified in several
ways. The standard streams \var{*standard-input*},
\var{*standard-output*}, and \var{*standard-error*} take the format
from the value specified by \var{*default-external-format*}. The
default value of \var{*default-external-format*} is \kwd{iso8859-1}.
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
For files, \code{OPEN} takes the \kwd{external-format}
parameter to specify the format. The default external format is
\kwd{default}.
\subsection{Available External Formats}
The available external formats are listed below in
Table~\ref{table:external-formats}. The first column gives the
external format, and the second column gives a list of aliases that
can be used for this format. The set of aliases can be changed by
changing the \file{aliases} file.
For all of these formats, if an illegal sequence is encountered, no
error or warning is signaled. Instead, the offending sequence is
silently replaced with the Unicode REPLACEMENT CHARACTER (U$+$FFFD).
\begin{table}
\centering
\begin{tabular}{|l|l|p{3in}|}
\hline
\textbf{Format} & \textbf{Aliases} & \textbf{Description} \\
\hline
\hline
\kwd{iso8859-1} & \kwd{latin1} \kwd{latin-1} \kwd{iso-8859-1} & ISO8859-1 \\
\hline
\kwd{iso8859-2} & \kwd{latin2} \kwd{latin-2} \kwd{iso-8859-2} & ISO8859-2 \\
\hline
\kwd{iso8859-3} & \kwd{latin3} \kwd{latin-3} \kwd{iso-8859-3} & ISO8859-3 \\
\hline
\kwd{iso8859-4} & \kwd{latin4} \kwd{latin-4} \kwd{iso-8859-4} & ISO8859-4 \\
\hline
\kwd{iso8859-5} & \kwd{cyrillic} \kwd{iso-8859-5} & ISO8859-5 \\
\hline
\kwd{iso8859-6} & \kwd{arabic} \kwd{iso-8859-6} & ISO8859-6 \\
\hline
\kwd{iso8859-7} & \kwd{greek} \kwd{iso-8859-7} & ISO8859-7 \\
\hline
\kwd{iso8859-8} & \kwd{hebrew} \kwd{iso-8859-8} & ISO8859-8 \\
\hline
\kwd{iso8859-9} & \kwd{latin5} \kwd{latin-5} \kwd{iso-8859-9} & ISO8859-9 \\
\hline
\kwd{iso8859-10} & \kwd{latin6} \kwd{latin-6} \kwd{iso-8859-10} & ISO8859-10 \\
\hline
\kwd{iso8859-13} & \kwd{latin7} \kwd{latin-7} \kwd{iso-8859-13} & ISO8859-13 \\
\hline
\kwd{iso8859-14} & \kwd{latin8} \kwd{latin-8} \kwd{iso-8859-14} & ISO8859-14 \\
\hline
\kwd{iso8859-15} & \kwd{latin9} \kwd{latin-9} \kwd{iso-8859-15} & ISO8859-15 \\
\hline
\kwd{utf-8} & \kwd{utf} \kwd{utf8} & UTF-8 \\
\hline
\kwd{utf-16} & \kwd{utf16} & UTF-16 with optional BOM \\
\hline
\kwd{utf-16-be} & \kwd{utf-16be} \kwd{utf16-be} & UTF-16 big-endian (without BOM) \\
\hline
\kwd{utf-16-le} & \kwd{utf-16le} \kwd{utf16-le} & UTF-16 little-endian (without BOM) \\
\hline
\kwd{utf-32} & \kwd{utf32} & UTF-32 with optional BOM \\
\hline
\kwd{utf-32-be} & \kwd{utf-32be} \kwd{utf32-be} & UTF-32 big-endian (without BOM) \\
\hline
\kwd{utf-32-le} & \kwd{utf-32le} \kwd{utf32-le} & UTF-32 little-endian (without BOM) \\
\hline
\kwd{cp1250} & & \\
\hline
\kwd{cp1251} & & \\
\hline
\kwd{cp1252} & \kwd{windows-1252} \kwd{windows-cp1252} \kwd{windows-latin1} & \\
\hline
\kwd{cp1253} & & \\
\hline
\kwd{cp1254} & & \\
\hline
\kwd{cp1255} & & \\
\hline
\kwd{cp1256} & & \\
\hline
\kwd{cp1257} & & \\
\hline
\kwd{cp1258} & & \\
\hline
\kwd{koi8-r} & & \\
\hline
\kwd{mac-cyrillic} & & \\
\hline
\kwd{mac-greek} & & \\
\hline
\kwd{mac-icelandic} & & \\
\hline
\kwd{mac-latin2} & & \\
\hline
\kwd{mac-roman} & & \\
\hline
\kwd{mac-turkish} & & \\
\hline
\end{tabular}
\caption{External formats}
\label{table:external-formats}
\end{table}
\subsection{Composing External Formats}
A composing external format is an external format that converts between
one codepoint and another, rather than between codepoints and octets.
A composing external format must be used in conjunction with another
(octet-producing) external format. This is specified by
using a list as the external format. For example, we can use
\code{'(\kwd{latin1} \kwd{crlf})} as the external format. In this
particular example, the external format is latin1, but whenever a
carriage-return/linefeed sequence is read, it is converted to the Lisp
\lispchar{Newline} character. Conversely, whenever a string is written,
a Lisp \lispchar{Newline} character is converted to a
carriage-return/linefeed sequence. Without the \kwd{crlf} composing
format, the carriage-return and linefeed will be read in as separate
characters, and on output the Lisp \lispchar{Newline} character is
output as a single linefeed character.
Table~\ref{table:composing-formats} lists the available composing formats.
\begin{table}
\centering
\begin{tabular}{|l|l|p{3in}|}
\hline
\textbf{Format} & \textbf{Aliases} & \textbf{Description} \\
\hline
\hline
\kwd{crlf} & \kwd{dos} & Composing format for converting to/from DOS (CR/LF)
end-of-line sequence to \lispchar{Newline}\\
\kwd{cr} & \kwd{mac} & Composing format for converting to/from DOS (CR/LF)
end-of-line sequence to \lispchar{Newline}\\
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
\hline
\kwd{beta-gk} & & Composing format that translates (lower-case) Beta
code (an ASCII encoding of ancient Greek) \\
\hline
\kwd{final-sigma} & & Composing format that attempts to detect sigma in
word-final position and change it from U+3C3 to U+3C2\\
\hline
\end{tabular}
\caption{Composing external formats}
\label{table:composing-formats}
\end{table}
\section{Dictionary}
\subsection{Variables}
\begin{defvar}{extensions:}{default-external-format}
This is the default external format to use for all newly opened
files. It is also the default format to use for
\var{*standard-input*}, \var{*standard-output*}, and
\var{*standard-error*}. The default value is \kwd{iso8859-1}.
Setting this will cause the standard streams to start using the new
format immediately. If a stream has been created with external
format \kwd{default}, then setting \var{*default-external-format*}
will cause all subsequent input and output to use the new value of
\var{*default-external-format*}.
\end{defvar}
\subsection{Characters}
Remember that \cmucl{}'s characters are only 16-bits long but Unicode
codepoints are up to 21 bits long. Hence there are codepoints that
cannot be represented via Lisp characters. Operating on individual
characters is not recommended. Operations on strings are better.
(This would be true even if \cmucl{}'s characters could hold a
full Unicode codepoint.)
\begin{defun}{}{char-equal}{\amprest{} \var{characters}}
\defunx{char-not-equal}{\amprest{} \var{characters}}
\defunx{char-lessp}{\amprest{} \var{characters}}
\defunx{char-greaterp}{\amprest{} \var{characters}}
\defunx{char-not-greaterp}{\amprest{} \var{characters}}
\defunx{char-not-lessp}{\amprest{} \var{characters}}
For the comparison, the characters are converted to lowercase and
the corresponding \code{char-code} are compared.
\end{defun}
\begin{defun}{}{alpha-char-p}{\args \var{character}}
Returns non-nil{} if the Unicode category is a letter category.
\end{defun}
\begin{defun}{}{alphanumericp}{\args \var{character}}
Returns non-nil{} if the Unicode category is a letter category or an ASCII
digit.
\end{defun}
\begin{defun}{}{digit-char-p}{\args \var{character} \ampoptional{} \var{radix}}
Only recognizes ASCII digits (and ASCII letters if the radix is larger
than 10).
\end{defun}
\begin{defun}{}{graphic-char-p}{\args \var{character}}
Returns non-nil{} if the Unicode category is a graphic category.
\end{defun}
\begin{defun}{}{upper-case-p}{\args \var{character}}
\defunx{lower-case-p}{\args \var{character}}
Returns non-nil{} if the Unicode category is an uppercase
(lowercase) character.
\end{defun}
\begin{defun}{lisp:}{title-case-p}{\args \var{character}}
Returns non-nil{} if the Unicode category is a titlecase character.
\end{defun}
\begin{defun}{}{both-case-p}{\args \var{character}}
Returns non-nil{} if the Unicode category is an uppercase,
lowercase, or titlecase character.
\end{defun}
\begin{defun}{}{char-upcase}{\args \var{character}}
\defunx{char-downcase}{\args \var{character}}
The Unicode uppercase (lowercase) letter is returned.
\end{defun}
\begin{defun}{lisp:}{char-titlecase}{\args \var{character}}
The Unicode titlecase letter is returned.
\end{defun}
\begin{defun}{}{char-name}{\args \var{char}}
If possible the name of the character \var{char} is returned. If
there is a Unicode name, the Unicode name is returned, except
spaces are converted to underscores and the string is capitalized
via \code{string-capitalize}. If there is no Unicode name, the
form \lispchar{U+xxxx} is returned where ``xxxx'' is the
\code{char-code} of the character, in hexadecimal.
\end{defun}
\begin{defun}{}{name-char}{\args \var{name}}
The inverse to \code{char-name}. If no character has the name
\var{name}, then \nil{} is returned. Unicode names are not
case-sensitive, and spaces and underscores are optional.
\end{defun}
\subsection{Strings}
Strings in \cmucl{} are UTF-16 strings. That is, for Unicode code
points greater than 65535, surrogate pairs are used. We refer the
reader to the Unicode standard for more information about surrogate
pairs. We just want to make a note that because of the UTF-16
encoding of strings, there is a distinction between Lisp characters
and Unicode codepoints. The standard string operations know about
this encoding and handle the surrogate pairs correctly.
\begin{defun}{}{string-upcase}{\args \var{string} \keys{\kwd{start}
\kwd{end} \kwd{casing}}}
\defunx{string-downcase}{\args \var{string} \keys{\kwd{start}
\kwd{end} \kwd{casing}}}
\defunx{string-capitalize}{\args \var{string} \keys{\kwd{start}
\kwd{end} \kwd{casing}}}
The case of the \var{string} is changed appropriately. Surrogate
pairs are handled correctly. The conversion to the appropriate case
is done based on the Unicode conversion. The additional argument
\kwd{casing} controls how case conversion is done. The default
value is \kwd{simple}, which uses simple Unicode case conversion.
If \kwd{casing} is \kwd{full}, then full Unicode case conversion is
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
done where the string may actually increase in length.
\end{defun}
\begin{defun}{}{nstring-upcase}{\args \var{string} \keys{\kwd{start} \kwd{end}}}
\defunx{nstring-downcase}{\args \var{string} \keys{\kwd{start} \kwd{end}}}
\defunx{nstring-capitalize}{\args \var{string} \keys{\kwd{start}
\kwd{end}}}
The case of the \var{string} is changed appropriately. Surrogate
pairs are handled correctly. The conversion to the appropriate case
is done based on the Unicode conversion. (Full casing is not
available because the string length cannot be increased when needed.)
\end{defun}
\begin{defun}{}{string=}{\args \var{s1} \var{s2} \keys{\kwd{start1}
\kwd{end1} \kwd{start2} \kwd{end2}}}
\defunx{string/=}{\args \var{s1} \var{s2} \keys{\kwd{start1} \kwd{end1} \kwd{start2} \kwd{end2}}}
\defunx{string$<$}{\args \var{s1} \var{s2} \keys{\kwd{start1} \kwd{end1} \kwd{start2} \kwd{end2}}}
\defunx{string$>$}{\args \var{s1} \var{s2} \keys{\kwd{start1} \kwd{end1} \kwd{start2} \kwd{end2}}}
\defunx{string$<$=}{\args \var{s1} \var{s2} \keys{\kwd{start1} \kwd{end1} \kwd{start2} \kwd{end2}}}
\defunx{string$>$=}{\args \var{s1} \var{s2} \keys{\kwd{start1} \kwd{end1} \kwd{start2} \kwd{end2}}}
The string comparison is done in codepoint order. (This is
different from just comparing the order of the individual characters
due to surrogate pairs.) Unicode collation is not done.
\end{defun}
\begin{defun}{}{string-equal}{\args \var{s1} \var{s2} \keys{\kwd{start1}
\kwd{end1} \kwd{start2} \kwd{end2}}}
\defunx{string-not-equal}{\args \var{s1} \var{s2} \keys{\kwd{start1} \kwd{end1} \kwd{start2} \kwd{end2}}}
\defunx{string-lessp}{\args \var{s1} \var{s2} \keys{\kwd{start1} \kwd{end1} \kwd{start2} \kwd{end2}}}
\defunx{string-greaterp}{\args \var{s1} \var{s2} \keys{\kwd{start1} \kwd{end1} \kwd{start2} \kwd{end2}}}
\defunx{string-not-greaterp}{\args \var{s1} \var{s2} \keys{\kwd{start1} \kwd{end1} \kwd{start2} \kwd{end2}}}
\defunx{string-not-lessp}{\args \var{s1} \var{s2} \keys{\kwd{start1} \kwd{end1} \kwd{start2} \kwd{end2}}}
Each codepoint in each string is converted to lowercase and the
appropriate comparison of the codepoint values is done. Unicode
collation is not done.
\end{defun}
\begin{defun}{}{string-left-trim}{\args \var{bag} \var{string}}
\defunx{string-right-trim}{\args \var{bag} \var{string}}
\defunx{string-trim}{\args \var{bag} \var{string}}
Removes any characters in \code{bag} from the left, right, or both
ends of the string \code{string}, respectively. This has potential
problems if you want to remove a surrogate character from the
string, since a single character cannot represent a surrogate. As
an extension, if \code{bag} is a string, we properly handle
surrogate characters in the \code{bag}.
\end{defun}
\subsection{Sequences}
Since strings are also sequences, the sequence functions can be used
on strings. We note here some issues with these functions. Most
issues are due to the fact that strings are UTF-16 strings and
characters are UTF-16 code units, not Unicode codepoints.
\begin{defun}{}{remove-duplicates}{\args \var{sequence}
\keys{\kwd{from-end} \kwd{test} \kwd{test-not} \kwd{start}
\kwd{end} \kwd{key}}}
\defunx{delete-duplicates}{\args \var{sequence}
\keys{\kwd{from-end} \kwd{test} \kwd{test-not} \kwd{start}
\kwd{end} \kwd{key}}}
Because of surrogate pairs these functions may remove a high or low
surrogate value, leaving the string in an invalid state. Use these
functions carefully with strings.
\end{defun}
\subsection{Reader}
To support Unicode characters, the reader has been extended to
recognize characters written in hexadecimal. Thus \lispchar{U+41} is
the ASCII capital letter ``A'', since 41 is the hexadecimal code for
that letter. The Unicode name of the character is also recognized,
except spaces in the name are replaced by underscores.
Recall, however, that characters in \cmucl{} are only 16 bits long so
many Unicode characters cannot be represented. However, strings can
represent all Unicode characters.
When symbols are read, the symbol name is converted to Unicode NFC
form before interning the symbol into the package. Hence,
\code{symbol-name (intern ``string'')} may produce a string that is
not \code{string=} to ``string''. However, after conversion to NFC
form, the strings will be identical.
\subsection{Printer}
When printing characters, if the character is a graphic character, the
character is printed. Thus \lispchar{U+41} is printed as
\lispchar{A}. If the character is not a graphic character, the Lisp
name (e.g., \lispchar{Tab}) is used if possible;
if there is no Lisp name, the Unicode name is used. If there is no
Unicode name, the hexadecimal char-code is
printed. For example, \lispchar{U+34e}, which is not a graphic
character, is printed as \lispchar{Combining\_Upwards\_Arrow\_Below},
and \lispchar{U+9f} which is not a graphic character and does not have a
Unicode name, is printed as \lispchar{U+009F}.
\subsection{Miscellaneous}
\subsubsection{Files}
\cmucl{} loads external formats using the search-list
\file{ext-formats:}. The \file{aliases} file is also located using
this search-list.
The Unicode data base is stored in compressed form in the file
\file{ext-formats:unidata.bin}. If this file is not found, Unicode
support is severely reduced; you can only use ASCII characters.
\begin{defun}{}{open}{\args \var{filename} \amprest \var{options}
\keys{\kwd{direction} \kwd{element-type} \kwd{if-exists}
\kwd{if-does-not-exist} \morekeys \kwd{class} \kwd{mapped}
\kwd{input-handle} \kwd{output-handle}
\yetmorekeys \kwd{external-format} \kwd{decoding-error}
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
The main options are covered elsewhere. Here we describe the
options specific to Unicode. The option \kwd{external-format}
specifies the external format to use for reading and writing the
file. The external format is a keyword.
The options \kwd{decoding-error} and \kwd{encoding-error} are used
to specify how encoding and decoding errors are handled. The
default value on \nil means the external format handles errors
itself and typically replaces invalid sequences with the Unicode
replacement character.
Otherwise, the value for \code{decoding-error} is either a
character, a symbol or a function. If a character is
specified. it is used as the replacement character for any invalid
decoding. If a symbol or a function is given, it must be a
function of three arguments: a message string to be printed, the
offending octet, and the number of octets read. If the function
returns, it should return two values: the code point to use as the
replacement character and the number of octets read. In addition,
\true{} may be specified. This indicates that a continuable error
is signaled, which, if continued, the Unicode replacement
character is used.
For \code{encoding-error}, a character, symbol, or function can be
specified, like \code{decoding-error}, with the same meaning. The
function, however, takes two arguments: a format message string
and the incorrect codepoint. If the function returns, it should
be the replacement codepoint.
\end{defun}
\begin{defun}{stream:}{set-system-external-format}{\var{terminal} \ampoptional{} \var{filenames}}
This function changes the external format used for
\var{*standard-input*}, \var{*standard-output*}, and
\var{*standard-error*} to the external format specified by
\var{terminal}. Additionally, the Unix file name encoding can be
set to the value specified by \var{filenames} if non-\nil.
\end{defun}
\begin{defun}{extensions:}{list-all-external-formats}{}
list all of the vailable external formats. A list is returned where
each element is a list of the external format name and a list of
aliases for the format. No distinction is made between external
formats and composing external formats.
\end{defun}
\begin{defun}{extensions:}{describe-external-format}{external-format}
Print a description of the given \var{external-format}. This may
cause the external format to be loaded (silently) if it is not
already loaded.
\end{defun}
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
Since strings are UTF-16 and hence may contain surrogate pairs, some
utility functions are provided to make access easier.
\begin{defun}{lisp:}{codepoint}{\args \var{string} \var{i}
\ampoptional{} \var{end}}
Return the codepoint value from \var{string} at position \var{i}.
If code unit at that position is a surrogate value, it is combined
with either the previous or following code unit (when possible) to
compute the codepoint. The first return value is the codepoint
itself. The second return value is \nil{} if the position is not a
surrogate pair. Otherwise, $+1$ or $-1$ is returned if the position
is the high (leading) or low (trailing) surrogate value, respectively.
This is useful for iterating through a string in codepoint sequence.
\end{defun}
\begin{defun}{lisp:}{surrogates-to-codepoint}{\args \var{hi} \var{lo}}
Convert the given \var{hi} and \var{lo} surrogate characters to the
corresponding codepoint value
\end{defun}
\begin{defun}{lisp:}{surrogates}{\args \var{codepoint}}
Convert the given \var{codepoint} value to the corresponding high
and low surrogate characters. If the codepoint is less than 65536,
the second value is \nil{} since the codepoint does not need to be
represented as a surrogate pair.
\end{defun}
\begin{defun}{stream:}{string-encode}{\args \var{string}
\var{external-format} \ampoptional{} (\var{start} 0) \var{end}}
\code{string-encode} encodes \var{string} using the format
\var{external-format}, producing an array of octets. Each octet is
converted to a character via \code{code-char} and the resulting
string is returned.
The optional argument \var{start}, defaulting to 0, specifies the
starting index and \var{end}, defaulting to the length of the
string, is the end of the string.
\end{defun}
\begin{defun}{stream:}{string-decode}{\args \var{string}
\var{external-format} \ampoptional{} (\var{start} 0) \var{end}}
\code{string-decode} decodes \var{string} using the format
\var{external-format} and produces a new string. Each character of
\var{string} is converted to octet (by \code{char-code}) and the
resulting array of octets is used by the external format to produce
a string. This is the inverse of \code{string-encode}.
The optional argument \var{start}, defaulting to 0, specifies the
starting index and \var{end}, defaulting to the length of the
string, is the end of the string.
\var{string} must consist of characters whose \code{char-code} is
less than 256.
\end{defun}
\begin{defun}{}{string-to-octets}{\args \var{string} \keys{\kwd{start}
\kwd{end} \kwd{external-format} \kwd{buffer} \kwd{buffer-start}
\kwd{error}}}
\code{string-to-octets} converts \var{string} to a sequence of
octets according to the external format specified by
\var{external-format}. The string to be converted is bounded by
\var{start}, which defaults to 0, and \var{end}, which defaults to
the length of the string. If \var{buffer} is specified, the octets
are placed in \var{buffer}. If \var{buffer} is not specified, a new
array is allocated to hold the octets. \var{buffer-start} specifies
where in the buffer the first octet will be placed.
An error method may also be specified by \var{error}. Any errors
encountered while converting the string to octets will be handled
according to error. If \nil{}, a replacement character is converted
to octets in place of the error. Otherwise, \var{error} should be a
symbol or function that will be called when the error occurs. The
function takes two arguments: an error string and the character
that caused the error. It should return a replacement character.
Three values are returned: The buffer, the number of valid octets
written, and the number of characters converted. Note that the
actual number of octets written may be greater than the returned
value, These represent the partial octets of the next character to
be converted, but there was not enough room to hold the complete set
of octets.
\end{defun}
\begin{defun}{}{octets-to-string}{\args \var{octets} \keys{\kwd{start}
\kwd{end} \kwd{external-format} \kwd{string} \kwd{s-start}
\kwd{s-end} \kwd{state}}}
\code{octets-to-string} converts the sequence of octets in
\var{octets} to a string. \var{octets} must be a
\code{(simple-array (unsigned-byte 8) (*))}. The octets to be
converted are bounded by \var{start} and \var{end}, which default to
0 and the length of the array, respectively. The conversion is
performed according to the external format specified by
\var{external-format}. If \var{string} is specified, the octets are
converted and stored in \var{string}, starting at \var{s-start}
(defaulting to 0) and ending just before \var{s-end} (defaulting to
the end of \var{string}. \var{string} must be \code{simple-string}.
If the bounded string is not large enough to hold all of the
characters, then some octets will not be converted. If \var{string}
is not specified, a new string is created.
The \var{state} is used as the initial state of for the external
format. This is useful when converting buffers of octets where the
buffers are not on character boundaries, and state information is
needed between buffers.
Four values are returned: the string, the number of characters
written to the string, and the number of octets consumed to produce
the characters, and the final state of external format after
converting the octets.
\subsection{External Formats}
Users may write their own external formats. It is probably easiest to
look at existing external formats to see how do this.
An external format basically needs two functions:
\code{octets-to-code} to convert octets to Unicode codepoints and
\code{code-to-octets} to convert Unicode codepoints to octets. The
external format is defined using the macro
\code{stream::define-external-format}.
\begin{defmac}[base]{stream:}{define-external-format}{\args \var{name}
(\keys{\var{base} \var{min} \var{max} \var{size}
\var{documentation}})
(\amprest{} \var{slots})
\morekeys{\var{octets-to-code} \var{code-to-octets}
\var{flush-state} \var{copy-state}}}
If \kwd{base} is not given, this defines a new external format of
the name \kwd{name}. \var{min}, \var{max}, and \var{size} are the
minimum and maximum number of octets that make up a character.
(\code{\kwd{size} n} is just a short cut for \code{\kwd{min} n
\kwd{max} n}.) The description of the external format can be
given using \kwd{documentation}. The arguments \var{octets-to-code}
and \var{code-to-octets} are not optional in this case. They
specify how to convert octets to codepoints and vice versa,
respectively. These should be backquoted forms for the body of a
function to do the conversion. See the description below for these
functions. Some good examples are the external format for
\kwd{utf-8} or \kwd{utf-16}. The \kwd{slots} argument is a list of
read-only slots, similar to defstruct. The slot names are available
as local variables inside the \var{code-to-octets} and
\var{octets-to-code} bodies.
If \kwd{base} is given, then an external format is defined with the
name \kwd{name} that is based on a previously defined format
\kwd{base}. The \var{slots} are inherited from the \kwd{base} format
by default, although the definition may alter their values and add
new slots. See, for example, the \kwd{mac-greek} external format.
\begin{defmac}{}{octets-to-code}{\args \var{state} \var{input}
\var{unput} \var{error} \amprest{} \var{args}}
This defines a form to be used by an external format to convert
octets to a code point. \var{state} is a form that can be used by
the body to access the state variable of a stream. This can be used
for any reason to hold anything needed by \code{octets-to-code}.
\var{input} is a form that returns one octet from the input stream.
\var{unput} will put back \var{N} octets to the stream. \var{args} is a
list of variables that need to be defined for any symbols in the
body of the macro.
\var{error} controls how errors are handled. If \nil, some suitable
replacement character is used. That is, any errors are silently
ignored and replaced by some replacement character. If non-\nil,
\var{error} is a symbol or function that is called to handle the
error. This function takes three arguments: a message string, the
invalid octet (or \nil), and a count of the number of octets that
have been read so far. If the function returns, it should be the
codepoint of the desired replacement character.
\begin{defmac}{}{code-to-octets}{\args \var{code} \var{state}
\var{output} \var{error} \amprest{} \var{args}}
Defines a form to be used by the external format to convert a code
point to octets for output. \var{code} is the code point to be
converted. \var{state} is a form to access the current value of the
stream's state variable. \var{output} is a form that writes one
octet to the output stream.
Similar to \code{octets-to-code}, \var{error} indicates how errors
should be handled. If \nil, some default replacement character is
substituted. If non-\nil, \var{error} should be a symbol or
function. This function takes two arguments: a message string and
the invalid codepoint. If the function returns, it should be the
codepoint that will be substituted for the invalid codepoint.
\begin{defmac}{}{flush-state}{\args \var{state}
\var{output} \var{error} \amprest{} \var{args}}
Defines a form to be used by the external format to flush out
any state when an output stream is closed. Similar to
\code{code-to-octets}, but there is no code point to be output. The
\var{error} argument indicates how to handle errors. If \nil, some
default replacement character is used. Otherwise, \var{error} is a
symbol or function that will be called with a message string and
codepoint of the offending state. If the function returns, it
should be the codepoint of a suitable replacement.
If \code{flush-state} is \false, then nothing special is needed to
flush the state to the output.
This is called only when an output character stream is being closed.
\end{defmac}
\begin{defmac}{}{copy-state}{\args \var{state} \amprest{} \var{args}}
Defines a form to copy any state needed by the external format.
This should probably be a deep copy so that if the original
state is modified, the copy is not.
If not given, then nothing special is needed to copy the state
either because there is no state for the external format or that no
special copier is needed.
\end{defmac}
\subsection{Composing External Formats}
\begin{defmac}{stream:}{define-composing-external-format}{\args \var{name}
(\keys{\var{min} \var{max} \var{size} \var{documentation}}) \var{input}