Parent Directory | Revision Log
|Links to HEAD:||(view) (annotate)|
Update to Unicode 6.0.0. code/unidata.lisp: o Update unicode version to 6.0.0 o Add pointer to build-unidata.lisp. tools/build-unidata.lisp: o Update unicode version to 6.0.0 o Print out directory path so we can see where we're getting the data from. i18n/CaseFolding.txt i18n/CompositionExclusions.txt i18n/DerivedNormalizationProps.txt i18n/NameAliases.txt i18n/NormalizationCorrections.txt i18n/SpecialCasing.txt i18n/UnicodeData.txt i18n/WordBreakProperty.txt i18n/tests/NormalizationTest.txt i18n/tests/WordBreakTest.txt: o Update with new files from unicode.org.
Add function to load all unicode data into memory. This makes it easy to make an executable image that doesn't need unidata.bin around. (Should we do this for normal cores? It seems to add about 1 MB to the core size.) code/unidata.lisp: o Add LOAD-ALL-UNICODE-DATA to load all unicode data. o Add UNICODE-DATA-LOADED-P to check that unicode data has been loaded. code/print.lisp: o If unicode data is loaded, don't check for existence of *unidata-path*, because we don't need it. code/exports.lisp: o Export LOAD-ALL-UNICODE-DATA. general-info/release-20c.txt: o Update info
Add -unidata option to specify unidata.bin file. This change requires a cross-compile. Use boot-2011-04-01-cross.lisp as the cross-compile script. bootfiles/20b/boot-2011-04-01-cross.lisp: o New cross-compile bootstrap file lisp/lisp.c: o Recognize -unidata option and setup *UNIDATA-PATH* appropriately. code/commandline.lisp: o Add defswitch for unidata so we don't get complaints about unknown switch. code/unidata.lisp: o Rename +UNIDATA-PATH+ to *UNIDATA-PATH*, since it's not a constant anymore. o Update code to use new name. code/print.lisp: o Update code to use *UNIDATA-PATH* compiler/sparc/parms.lisp: o Add *UNIDATA-PATH* to list of static symbols. o Add back in spare-9 and spare-8 static symbols since we need to do a cross-compile for this change anyway. compiler/x86/parms.lisp: o Add *UNIDATA-PATH* to list of static symbols. o Reorder the static symbols in a more logical arrangment so that the spare symbols are at the end. i18n/local/cmucl.pot: o Update
Remove extra right parenthesis.
Fix bug where cmucl was no longer recognizing things like #\latin_small_letter_a. This failure is caused by the new SEARCH-DICTIONARY function that does partial completion, and UNICODE-NAME-TO-CODEPOINT function wan't aware of the new way. We could change UNICODE-NAME-TO-CODEPOINT to do the appropriate thing with the new way, but I (rtoy) decided it would be nice to have the old function around too. Hence, restore the old version and use it.
Add a function to create the key from two codepoints that can be used as the key for the composition table. That way the logic is in exactly one place and not spread out through the code.
When there's more than one possible completion, we need to keep the original completions along with the extensions.
Was mishandling the case where there are no more completions. In this case we were returning the prefix string, but that would be incorrect if the prefix string is not a valid character. So check that it is valid and return it. Otherwise do nothing (thereby returning nil) so slime can note the character is invalid.
Improve completion of Hangul syllables and CJK unified ideographs some more and fix some bugs in previous change.
o Move %STR, %STRX and %MATCH around so that we can inline them (because they're so simple). o Add some comments for %STR. o Change implementation of %MATCH to be simpler and add comments on why we do what we do and explain what happens if we don't. o Handle completion of Hangul syllables better: - Match "Hangul_S" instead of "Hangul_Syllable" because there's #\Hangul_Single_Dot_Tone_Mark. - If we match "Hangul_S", try to complete some Hangul syllables so we don't fool slime into thinking "Hangul_Syllable_" is the only completion. There are obviously more. o Handle completion of CJK Unified Ideographs better by trying to complete more so slime isn't fooled into thinking "CJK_Unified_Ideograph-" is the only possible completion.
o Construction of the Hangul syllable codebook was wrong. To satisfy the constraints on the codebook, we just sort them in descreasing order of length. o In %MIP, it might happen that MISMATCH returns NIL, which means a match. In this case, don't change the position.
Some Hangul syllables were left out of the Hangul syllable dictionary. Redo this by looping over all codepoints and selecting the codepoints that are Hangul syllables.
code/unidata.lisp: o Update constants to Unicode version 5.2.0. i18n/unidata.bin: o Regenerated using Unicode version 5.2.0.
code/unidata.lisp: o Just add some comments on why we don't put the dictionaries in unidata.bin. o Print out some messages when building the hangul and cjk dictionaries so the user knows what's happening. tools/build-unidata.lisp: o Add some comments on the various parts of unidata.bin.
exports.lisp: o Export STRING-TO-NFC, UNICODE-COMPLETE, and UNICODE-COMPLETE-NAME. unidata.lisp: o Add explicit exports.
Optimize the completion of the Hangul syllables and the CJK unified ideographs by using dictionaries. (Should these dictionaries be part of unidata.bin so they don't have to be built at run time? One the one hand, it makes things simpler, but unnecessarily bloats unidata.in. I suspect the hangul syllables and cjk ideographs characters not not used very often.) o Change NODE-NEXT and CLOSE-NODE to have an optional parameter for the dictionary to use. o Update UNICODE-COMPLETE-NAME to pass the dictionary to NODE-NEXT and CLOSE-NODE. o Update UNICODE-COMPLETE to use the hangul syllable dictionary and the cjk ideograph dictionary when searching. o Fix typo in UNICODE-COMPLETE. o Add defvars for dictionaries for hangul syllables and cjk ideographs. o Add functions to build the hangul and cjk dictionaries. o Steal the implementations of BUILD-DICTIONARY, NAME-LOOKUP, and ENCODE-NAME from tools/build-unidata.lisp.
Add support for character completion. This is primarily intended to support character completion for slime. The implementation is from Paul Foley, but some slight modifications by Raymond Toy to handle a few corner cases. o Modify SEARCH-DICTIONARY to take optional current and posn parameters so that SEARCH-DICTIONARY can be started from a different place. o Add UNICODE-COMPLETE, which is the main function for character name completion. o Add other support functions for UNICODE-COMPLETE.
o Fix typo in UNICODE-DECOMP. (It's hangul-syllable-p, not hangule-syllable-p.) o Move the computation of *reverse-hangule-choseong*, *reverse-hangul-jungseong*, and *reverse-hangul-jongseong* to its own routine. Call it in UNICODE-NAME-TO-CODEPOINT.
Pull out the range tests for CJK Ideographs and Hangul Syllables and put the tests into their own functions so that the limits are on one place.
Add support for Unicode 5.2. The normalization and wordbreak tests pass. code/string.lisp: o In %compose, handle the case where the composite character is outside the BMP and thus needs special handling for our UTF-16 strings. code/unidata.lisp o CKJ Ideograph range has changed in 5.2. o Fix bug in build-composition-table. We were not correctly handling the case where the decomposition of a codepoint was outside the BMP. Special care is needed to handle the UTF-16 strings that we use. o The key for the pairwise composition table are the full codepoints, so we need to shift one by 21 bits instead of 16. tools/build-unidata.lisp o Update minor version to 2. i18n/BidiMirroring.txt i18n/CaseFolding.txt i18n/CompositionExclusions.txt i18n/DerivedNormalizationProps.txt i18n/NameAliases.txt i18n/NormalizationCorrections.txt i18n/SpecialCasing.txt i18n/UnicodeData.txt i18n/WordBreakProperty.txt i18n/tests/NormalizationTest.txt i18n/tests/WordBreakTest.txt o Updated from Unicode 5.2. i18n/unidata.bin o Regenerated from new Unicode 5.2 files.
UNICODE-NAME-TO-CODEPOINT was incorrectly accepting any value after #\cjk_unified_ideograph-nnnn and returning the character whose code was nnnn. This is wrong. o Add a new function to check for valid ranges for CJK unified ideographs. o Use it in UNICODE-NAME-TO-CODEPOINT and UNICODE-NAME.
Change uses of _"foo" to (intl:gettext "foo"). This is because slime may get confused with source locations if the reader macros are installed.
Merge intl-branch 2010-03-18 to HEAD. To build, you need to use boot-2010-02-1 as the bootstrap file. You should probably also use the new -P option for build.sh to generate and update the po files while building.
Restart internalization work. This new branch starts with code from the intl-branch on date 2010-02-12 18:00:00+0500. This version works and LANG=en@piglatin bin/lisp works (once the piglatin translation is added).
Mark translatable strings; update cmucl.pot and ko/cmucl.po accordingly.
Add (intl:textdomain "cmucl") to the files to set the textdomain.
tools/build-unidata.lisp: o Add support for word break properties. o Some cleanup of the code including moving the common code in write-ntrie* to write-ntrie. code/unidata.lisp: o Add support for word break properties. o UNICODE-WORD-BREAK-CODE and UNICODE-WORD-BREAK return the property code and the property keyword for a codepoint, respectively. i18n/WordBreakProperty.txt: o New file for the word break properties.
unidata.lisp: o Add *unidata-version* to hold our revision number. save.lisp: o Add Unicode to the herald items. Just print out the unidata version along with the supported Unicode UCD version.
boot-2009-07.lisp: o Bootstrap file needed to compile this change (because the current shrink-vector derive-type optimizer didn't handle union types). compiler/fndb.lisp: o Make the compiler warn if the result of lisp::shrink-vector is not used. This is a problem because the compiler doesn't know that shrink-vector destructively modifies the length of a vector. As a partial solution, warn the user if the result of shrink-vector is not. code/hash-new.lisp: code/seq.lisp: o Make sure the result of shrink-vector is used, to get rid of a new compiler warning. code/unidata.lisp: o Modify %unicode-full-case so that it doesn't use shrink-vector anymore. compiler/seqtran.lisp: o Fix shrink-vector derive-type optimizer to handle union types. tools/build-unidata.lisp: o Fix typo that someone got in. o Make sure the result of shrink-vector is used, to get rid of a new compiler warning.
code/string.lisp: o Only define STRING-TO-NFD, STRING-TO-NFKD, and STRING-TO-NFKC for Unicode builds. Conditionalize out their support functions too. o Update export list to be conditional on Unicode too. o Use new name for get-pairwise-composition. code/exports.lisp: o Update export list to be conditional on Unicode for above changes in string.lisp. code/unidata.lisp: o Change name from GET-PAIRWISE-COMPOSITION to UNICODE-PAIRWISE-COMPOSITION to match other Unicode function names.
Merge Unicode work to trunk. From label unicode-utf16-extfmt-2009-06-11.
Add link to Hangul composition sample code.
Add CaseFolding.txt to unidata.bin so we can do case-insensitive comparisons according to Unicode. i18n/CaseFolding.txt: o New file code/unidata.lisp o Add new slots to the unidata structure to hold the simple and full case-folding information. o Add UNICODE-CASE-FOLD-SIMPLE and UNICODE-CASE-FOLD-FULL functions to return the case-folded codepoint or string for the simple and full options, respectively. tools/build-unidata.lisp: o Add new slots to the unidata structure and the ucdent structure to hold the case folding information from CaseFolding.txt. o Update routines to read the case folding data and to write the data to unidata.bin. o Speed optimization: Use a hash table whose key is the codepoint and whose value is the index into the vector. This preserves the structure of the code but vastly improves the speed of reading and processing the unicode data files, especially for the derived normalization properties. (We should just replace the vector with the hash table.)
Oops. Forgot to call the default converter in %UNICODE-FULL-CASE and forgot to return the string.
tools/build-unidata.lisp: o Add support for reading SpecialCasing.txt to support full-casing operation. (Currently does not support language-specific cases or context dependent cases.) o Update some prints o Add check to write-unidata to produce an error if we try to write more objects than we have allocated space for in the index table. code/unidata.lisp: o Support loading the full case tables o Add functions to produce the full case string for a codepoint.
code/unidata.lisp: o Add UNICODE-ASSIGNED-CODEPOINT-P code/string.lisp: o Make UTF16-STRING-P check for unassigned codepoints in the string.
tools/build-unidata.lisp: o Read composition exclusions from the composition exclusions files and save it in unidata.bin. code/unidata.lisp: o Read composition exclusions from unidata.bin o Use the exclusions from unidata.bin instead of using the hand-initialized list. i18n/unidata.bin: o Updated with composition exclusions list.
Remove (debug 0) so DESCRIBE can give better information about these functions. (Should we adjust safety and space too? We probably don't need unsafe code everywhere.)
code/char.lisp: o Define CODEPOINT-LIMIT o Define CODEPOINT type code/extfmts.lisp code/string.lisp ode/unidata.lisp pcl/simple-streams/external-formats/utf-32.lisp pcl/simple-streams/external-formats/utf-8.lisp o Use the CODEPOINT type in declarations.
code/string.lisp: o Add function (setf codepoint) o Add docstrings for STRING-TO-NFC and STRING-TO-NFKC. o Move things related to pairwise composition to unidata.lisp. code/unidata.lisp: o Things related to pairwise composition moved here. o Adjust *COMPOSITION-EXCLUSION* to include only the non-commented items in CompositionExclusions.txt. o Make BUILD-COMPOSITION-TABLE to exclude characters that can be derived from the decomposition. (Basically, ignore the four decompositions of length greater than 1 that start with a non-zero combining class.)
Add support for quick check normalization properties. (From Paul.) i18n/DerivedNormalizationProps.txt: o New file containing the normalization data we need. tools/build-unidata.lisp: o Read the normalization properties and build unidata.bin to include four new tries, one each NFC/NFKC/NFD/NFKD. o Add new 1 and 2 bit tries. code/unidata.lisp: o Read the new data o Add new functions to return the quick check normalization data. code/stream-vector-io.lisp: code/stream.lisp: o Add support for 1, 2, and 4 bit vectors for stream I/O.
The Unicode 1.0 names were being stored in the wrong slots of *unicode-data*, overwriting the Unicode names. Put them in the right slots.
o Add constants for the magic number and the (expected) Unicode major, minor, and upgrade version to make the code slightly easier to read.
Use TRUNCATE instead of FLOOR. (Works around an issue with type derivation and the maybe-inlined FLOOR function.)
From Paul. Support Unicode names for Hangul syllables and the CJK ideographs. These names can all be computed from the codepoint.
Updates from Paul. With these changes, we pass the Unicode normalization test suite successfully for NFD and NFKD. unidata.lisp: o Implement algorithmic decomposition of Hangul. string.lisp: o Implement Unicode normalization forms NFD and NFKD.
Updates from Paul: o Fix some typos in comments. o Change UNICODE-DECOMP to use T to get compatibility decompositions. o Fix error in returning the compatibility decompositions.
From Paul. Make the decomp structure store strings instead of an array of 16-bit integers, since the space is the same.
code/string.lisp: o From Paul: - Handle the ASCII special casing in string.lisp instead of unidata.lisp - Add utility functions CODEPOINT and SURROGATES. code/unidata.lisp: o Remove the ASCII special cases from UNICODE-LOWER, UNICODE-UPPER, UNICODE-TITLE.
For UNICODE-LOWER, UNICODE-UPPER, and UNICODE-TITLE, add special case to handle ASCII without loading unidata.bin. This handles the issue of these functions getting called early in the init process before unicode is set up, in, for example, STRING-DOWNCASE, which is called when setting up search lists.
Bug fix from Paul: Some unnamed chars printed as "#\" because this [unicode-name+] was returning an empty string instead of NIL.
More updates from Paul: changes the order [of the unicode categories], which fixes some bugs, too. Need to rebuild unidata.bin once more.
UNICODE-NAME-TO-CODEPOINT and UNICODE-1.0-NAME-TO-CODEPOINT need to return NIL if the name can't be found. This fixes the issue where (NAME-CHAR "a") didn't return NIL since "a" isn't a name. (From Paul)
Support printing symbols with Unicode letters that have no case like Hangul. code/unidata.lisp: o Add +UNICODE-CATEGORY-OTHER+ to represent Unicode category Lo. code/print.lisp: o New attribute OTHERCASE-ATTRIBUTE for Unicode category Lo. o Update ATTRIBUTE-NAMES to include new attribute. o In SYMBOL-QUOTEP, adjust letter-attribute to include othercase-attribute as appropriate. o In REINIT-CHAR-ATTRIBUTES, initialize character attributes to include Unicode characters with category Lo.
o Fix bug in LOAD-SCASE: the ntrie is 32 bits, not 16. o Add constants for upper and lower case categories. (Primarily for use in char.lisp, so we don't ever have to modify char.lisp for this.)
Document the dictionary and ntrie structures. (From Paul Foley.)
Another update from Paul: added combining class, bidi info, and Unicode 1.0 names - that's everything from the base UnicodeData.txt (and a few additions). New files: BidiMirroring.txt and NormalizationCorrections.txt Updated unidata.bin too.
Updates from Paul: add numeric values and decompositions to unidata, added char-titlecase, and made string-capitalize use title-case rather than upper-case, when those are different. The unidata.bin file needs to be rebuilt, and a cross-compile needs to be done to support the new unidata.bin format.
New implementation of the unidata structures from Paul. He says he changed the implementation to use a three way split of the codepoint instead of binary search, renamed a few things, altered the way it encodes the general category information slightly, so that "Cn" (nonexistent character) turns into #x00 (was #x08), and fixed the case-conversion code (which ignored titlecase characters). Updated unidata.bin too with the new data.
Enable the optimization settings.
Import Paul's new routines for storing and accessing the Unicode data. i18n/NameAliases.txt: o New file: Unicode NameAliases tools/build-unidata.lisp: o New file: Reads UnicodeData.txt and NameAliases.txt and creates unidata.bin that is accessed by Lisp to obtain unicode information. code/unidata.lisp: o New file: Lisp interface to unidata.bin code/char.lisp: o Updated to use the new interface code/print.lisp: o Can't set up characer-attributes array with full Unicode data at startup because the search-list isn't set up yet. Hence, only initialize part of the array, and use an *after-save-initializations* function to fill array with Unicode data after the search-list has been initialized. compiler/srctran.lisp: o Update deftransforms to use the new interface. tools/make-main-dist.sh: o Copy unidata.bin into the distribution. tools/worldbuild.lisp: o Load unidata.lisp tools/worldcom.lisp: o Compile unidata.lisp
file unidata.lisp was initially added on branch unicode-utf16-extfmt-branch.
This form allows you to request diffs between any two revisions of this file. For each of the two "sides" of the diff, select a symbolic revision name using the selection box, or choose 'Use Text Field' and enter a numeric revision.
|Powered by ViewVC 1.1.5|