Parent Directory | Revision Log
|Links to HEAD:||(view) (annotate)|
|Links to unicode-snapshot-2009-06:||(view) (annotate)|
Slightly modify DECOMPOSE so it can operate on non simple strings.
code/char.lisp: o Define CODEPOINT-LIMIT o Define CODEPOINT type code/extfmts.lisp code/string.lisp ode/unidata.lisp pcl/simple-streams/external-formats/utf-32.lisp pcl/simple-streams/external-formats/utf-8.lisp o Use the CODEPOINT type in declarations.
code/seq.lisp: o Moved STRING-REVERSE* and STRING-NREVERSE* to string.lisp because we need to use WITH-STRING. code/string.lisp: o Fix STRING-REVERSE* and STRING-NREVERSE* which were not properly handling non-simple strings. The following tests were not returning "edcba": (let* ((x (make-array 10 :initial-contents "abcdefghij" :fill-pointer 5 :element-type 'base-char)) (y (reverse x))) y) (let* ((x (make-array 10 :initial-contents "abcdefghij" :fill-pointer 5 :element-type 'character)) (y (nreverse x))) y)
o Revert previous change to STRING-TO-NFC and STRING-TO-NFKC. o Use WITH-STRING in NORMALIZED-FORM-P so we operate on the underlying simple-string data.
NORMALIZED-FORM-P needs simple-strings. We should to this in a different way, but this will do for now.
code/string.lisp: o Add function (setf codepoint) o Add docstrings for STRING-TO-NFC and STRING-TO-NFKC. o Move things related to pairwise composition to unidata.lisp. code/unidata.lisp: o Things related to pairwise composition moved here. o Adjust *COMPOSITION-EXCLUSION* to include only the non-commented items in CompositionExclusions.txt. o Make BUILD-COMPOSITION-TABLE to exclude characters that can be derived from the decomposition. (Basically, ignore the four decompositions of length greater than 1 that start with a non-zero combining class.)
Add support for Unicode NFC and NFKC forms. Implement STRING-TO-NFC and STRING-TO-NFKC. This probably needs some more work. The composition table should probably be a trie and should be in unidata.bin instead of the hash table that we use now. The composition exclusion list should be probably be in unidata.bin too instead of here. These functions pass all of the normalization tests.
Fix bug in DECOMPOSE which was no longer sorting the combining characters in combining-category order. We now pass the NFD and NFKD normalization tests again. (Fix from Paul)
string.lisp: o Add SURROGATEP function to test if something is a surrogate value. extfmts.lisp: utf-16-be.lisp: utf-16-le.lisp: utf-16.lisp: utf-32-be.lisp: utf-32-le.lisp: utf-32.lisp: utf-8.lisp: o Use SURROGATEP.
Do case-insensitive comparison by converting to lower case instead of upper case. This is what Unicode CaseFolding.txt does. One example of where it matters is U+1E9E is mapped to a lower case U+DF. But the upper case version of U+DF is U+DF. char.lisp: o Change EQUAL-CHAR-CODE to convert to lowercase. string.lisp: o Change EQUAL-CHAR-CODEPOINT to convert to lowercase. o Fix mistake in STRING-LESS-GREATER-EQUAL which was incorrectly comparing the codepoints instead of the equal-char-codepoint values.
Add UTF16-STRING-P to determine if a string is a valid UTF-16 encoded string.
STRING-LESS-GREATER-EQUAL handles codepoints so STRING-LESSP and friends now sort in codepoint order (after converting to uppercase).
o Lots of spelling fixes from Paul. o Add unicode codepoints in final-sigma.lisp (in case the characters there don't show up correctly). o Support partial-fill in READ-INTO-STRING.
Simple docstrings for STRING-TO-NFD and STRING-TO-NFKD.
From Paul: Package and symbols names in Unicode need to be in a canonical normalization form (eventually...when NFC is implemented)
From Paul. Use CODEPOINT in %GLYPH-B.
Updates from Paul. o Use CODEPOINT instead of XCHAR in %GLYPH-F o Simplify DECOMPOSE
Updates from Paul. With these changes, we pass the Unicode normalization test suite successfully for NFD and NFKD. unidata.lisp: o Implement algorithmic decomposition of Hangul. string.lisp: o Implement Unicode normalization forms NFD and NFKD.
string.lisp: o Add Paul's SURROGATES-TO-CODEPOINT and remove CODEPOINT-FROM-SURROGATES. o Change SURROGATES to return characters, not numbers. o Update callers of SURROGATES to match. extfmts.lisp: o Update callers of SURROGATES to match. o Use CODEPOINT to extract the correct codepoint from a string in EF-STRING-TO-OCTETS and EF-OCTETS-TO-STRING.
o Add new function CODEPOINT-FROM-SURROGATES to compute the codepoint from two surrogate values. (Should we use a better name?) o Use the new function in CODEPOINT. o Add docstrings to the functions.
code/string.lisp: o From Paul: - Handle the ASCII special casing in string.lisp instead of unidata.lisp - Add utility functions CODEPOINT and SURROGATES. code/unidata.lisp: o Remove the ASCII special cases from UNICODE-LOWER, UNICODE-UPPER, UNICODE-TITLE.
NSTRING-UPCASE and NSTRING-DOWNCASE were referencing the unknown symbols NEWSTRING and NEW-INDEX. Replace with STRING and INDEX, respectively. I think that's what was intended.
From Paul: Here's a version of [n]string-(up|down)case that handles non-BMP characters. Also added functionless stubs for normalization forms. Improved string-reverse* and implemented string-nreverse* in a way that shouldn't cons (not the original way I worked out, which might be faster but is quite complicated). (The glyph builder now stops when it hits a combining character that's out of sequence (canonical order)---I'm not sure whether or not that's the Right Thing to do)
More updates from Paul. code/seq.lisp: o Update SEQ-DISPATCH to allow a special dispatch form for strings. o Implement STRING-REVERSE* that correctly handles our UTF-16 strings. o Implement STRING-NREVERSE*, but this needs work to reduce consing. code/string.lisp: o Add GLYPH and SGLYPH to return the glyph from a position in a string. code/exports.lisp: o Export GLYPH and SGLYPH
Updates from Paul: add numeric values and decompositions to unidata, added char-titlecase, and made string-capitalize use title-case rather than upper-case, when those are different. The unidata.bin file needs to be rebuilt, and a cross-compile needs to be done to support the new unidata.bin format.
Merge from unicode-utf16 branch, label unicode-utf16-char-support-2009-03-25 to get character support.
Instead of ignoring the :element-type argument to MAKE-STRING, we check that it's a valid subtype of character (then ignore it).
From eric Marsden: Fix some error types to be ANSI compliant.
A few well placed inhibit-warnings declarations to suppress noise in compile-lisp.log. Only 46/130 notes left.
ANSI CL compat. changes: o Add an optional environment argument to constantp; ignored by CMUCL. o Add the :element-type keyword to make-string.
Merged DTC's patch to string<>=*-body which fixes various problems that arose when :start2 :end2 values were specified.
Fix headed boilerplate.
Removed an extra ``)''.
Changed STRING-xxxCASE to not assign arguments.
Changed the WITH-xxx-STRINGs macros to use simply WITH-ARRAY-DATA, now that it is more clever. Also, changed it to accept any STRINGable thing, instead of just strings and symbols. These macros now bind the offset var instead of randomly setting it.
New file header with RCS header FILE-COMMENT.
Moved MIPS branch onto trunk; no merge necessary.
This form allows you to request diffs between any two revisions of this file. For each of the two "sides" of the diff, select a symbolic revision name using the selection box, or choose 'Use Text Field' and enter a numeric revision.
|Powered by ViewVC 1.1.5|