/[cmucl]/src/tools/build-unidata.lisp
ViewVC logotype

Log of /src/tools/build-unidata.lisp

Parent Directory Parent Directory | Revision Log Revision Log


Links to HEAD: (view) (annotate)
Sticky Tag:

Revision 1.9 - (view) (annotate) - [select for diffs]
Mon Jun 27 15:11:30 2011 UTC (2 years, 9 months ago) by rtoy
Branch: MAIN
CVS Tags: GIT-CONVERSION, HEAD, snapshot-2011-07, snapshot-2011-09
Changes since 1.8: +4 -4 lines
Diff to previous 1.8
Update to Unicode 6.0.0.


code/unidata.lisp:
o Update unicode version to 6.0.0
o Add pointer to build-unidata.lisp.
tools/build-unidata.lisp:
o Update unicode version to 6.0.0
o Print out directory path so we can see where we're getting the data
  from.


i18n/CaseFolding.txt
i18n/CompositionExclusions.txt
i18n/DerivedNormalizationProps.txt
i18n/NameAliases.txt
i18n/NormalizationCorrections.txt
i18n/SpecialCasing.txt
i18n/UnicodeData.txt
i18n/WordBreakProperty.txt
i18n/tests/NormalizationTest.txt
i18n/tests/WordBreakTest.txt:
o Update with new files from unicode.org.

Revision 1.4.14.1 - (view) (annotate) - [select for diffs]
Sun Sep 19 03:31:40 2010 UTC (3 years, 6 months ago) by rtoy
Branch: RELEASE-20B-BRANCH
CVS Tags: RELEASE_20b
Changes since 1.4: +2 -2 lines
Diff to previous 1.4 , to next main 1.9
Merge fix for long standing bug where the trie for Unicode 1.0 names
was given the wrong split value.

Revision 1.8 - (view) (annotate) - [select for diffs]
Sun Sep 19 03:31:12 2010 UTC (3 years, 6 months ago) by rtoy
Branch: MAIN
CVS Tags: cross-sol-x86-2010-12-20, cross-sol-x86-base, cross-sol-x86-merged, cross-sparc-branch-base, snapshot-2010-11, snapshot-2010-12, snapshot-2011-01, snapshot-2011-02, snapshot-2011-03, snapshot-2011-04, snapshot-2011-06
Branch point for: cross-sol-x86-branch, cross-sparc-branch
Changes since 1.7: +2 -2 lines
Diff to previous 1.7
Oops.  Fix long standing bug where the trie for Unicode 1.0 names was
given the wrong split value.

Revision 1.7 - (view) (annotate) - [select for diffs]
Sat Sep 18 20:58:28 2010 UTC (3 years, 6 months ago) by rtoy
Branch: MAIN
Changes since 1.6: +19 -31 lines
Diff to previous 1.6
Simple refactoring:  Add function to write out a dictionary and use it
to write out the unicode name dictionaries.

Revision 1.6 - (view) (annotate) - [select for diffs]
Sat Sep 18 20:47:51 2010 UTC (3 years, 6 months ago) by rtoy
Branch: MAIN
Changes since 1.5: +23 -18 lines
Diff to previous 1.5
code/unidata.lisp:
o Just add some comments on why we don't put the dictionaries in
  unidata.bin.
o Print out some messages when building the hangul and cjk
  dictionaries so the user knows what's happening.

tools/build-unidata.lisp:
o Add some comments on the various parts of unidata.bin.

Revision 1.5 - (view) (annotate) - [select for diffs]
Wed Sep 15 21:06:39 2010 UTC (3 years, 7 months ago) by rtoy
Branch: MAIN
Changes since 1.4: +2 -2 lines
Diff to previous 1.4
Add support for Unicode 5.2.  The normalization and wordbreak tests pass.

code/string.lisp:
o In %compose, handle the case where the composite character is
  outside the BMP and thus needs special handling for our UTF-16
  strings.

code/unidata.lisp
o CKJ Ideograph range has changed in 5.2.
o Fix bug in build-composition-table.  We were not correctly handling
  the case where the decomposition of a codepoint was outside the
  BMP.  Special care is needed to handle the UTF-16 strings that we
  use.
o The key for the pairwise composition table are the full codepoints,
  so we need to shift one by 21 bits instead of 16.

tools/build-unidata.lisp
o Update minor version to 2.

i18n/BidiMirroring.txt
i18n/CaseFolding.txt
i18n/CompositionExclusions.txt
i18n/DerivedNormalizationProps.txt
i18n/NameAliases.txt
i18n/NormalizationCorrections.txt
i18n/SpecialCasing.txt
i18n/UnicodeData.txt
i18n/WordBreakProperty.txt
i18n/tests/NormalizationTest.txt
i18n/tests/WordBreakTest.txt
o Updated from Unicode 5.2.

i18n/unidata.bin
o Regenerated from new Unicode 5.2 files.

Revision 1.4 - (view) (annotate) - [select for diffs]
Fri Sep 11 16:22:35 2009 UTC (4 years, 7 months ago) by rtoy
Branch: MAIN
CVS Tags: amd64-dd-start, intl-2-branch-base, intl-branch-2010-03-18-1300, intl-branch-base, intl-branch-working-2010-02-11-1000, intl-branch-working-2010-02-19-1000, post-merge-intl-branch, pre-merge-intl-branch, release-20b-pre1, release-20b-pre2, snapshot-2009-11, snapshot-2009-12, snapshot-2010-01, snapshot-2010-02, snapshot-2010-03, snapshot-2010-04, snapshot-2010-05, snapshot-2010-06, snapshot-2010-07, snapshot-2010-08, sparc-tramp-assem-2010-07-19, sparc-tramp-assem-base, unicode-string-buffer-base, unicode-string-buffer-impl-base
Branch point for: RELEASE-20B-BRANCH, amd64-dd-branch, intl-2-branch, intl-branch, sparc-tramp-assem-branch, unicode-string-buffer-branch, unicode-string-buffer-impl-branch
Changes since 1.3: +164 -113 lines
Diff to previous 1.3
tools/build-unidata.lisp:
o Add support for word break properties.
o Some cleanup of the code including moving the common code in
  write-ntrie* to write-ntrie.

code/unidata.lisp:
o Add support for word break properties.
o UNICODE-WORD-BREAK-CODE and UNICODE-WORD-BREAK return the property
  code and the property keyword for a codepoint, respectively.

i18n/WordBreakProperty.txt:
o New file for the word break properties.

Revision 1.3 - (view) (annotate) - [select for diffs]
Thu Jul 2 21:00:48 2009 UTC (4 years, 9 months ago) by rtoy
Branch: MAIN
CVS Tags: RELEASE_20a, release-20a-base, release-20a-pre1, snapshot-2009-08
Branch point for: RELEASE-20A-BRANCH
Changes since 1.2: +4 -4 lines
Diff to previous 1.2
boot-2009-07.lisp:
o Bootstrap file needed to compile this change (because the current
  shrink-vector derive-type optimizer didn't handle union types).

compiler/fndb.lisp:
o Make the compiler warn if the result of lisp::shrink-vector is not
  used.  This is a problem because the compiler doesn't know that
  shrink-vector destructively modifies the length of a vector.  As a
  partial solution, warn the user if the result of shrink-vector is
  not.

code/hash-new.lisp:
code/seq.lisp:
o Make sure the result of shrink-vector is used, to get rid of a new
  compiler warning.

code/unidata.lisp:
o Modify %unicode-full-case so that it doesn't use shrink-vector
  anymore.

compiler/seqtran.lisp:
o Fix shrink-vector derive-type optimizer to handle union types.

tools/build-unidata.lisp:
o Fix typo that someone got in.
o Make sure the result of shrink-vector is used, to get rid of a new
  compiler warning.

Revision 1.2 - (view) (annotate) - [select for diffs]
Thu Jun 11 16:04:02 2009 UTC (4 years, 10 months ago) by rtoy
Branch: MAIN
CVS Tags: merged-unicode-utf16-extfmt-2009-06-11, portable-clx-base, portable-clx-import-2009-06-16, snapshot-2009-07
Branch point for: portable-clx-branch
Changes since 1.1: +1083 -0 lines
Diff to previous 1.1
Merge Unicode work to trunk.  From label
unicode-utf16-extfmt-2009-06-11.

Revision 1.1.2.13 - (view) (annotate) - [select for diffs]
Wed Jun 10 00:28:22 2009 UTC (4 years, 10 months ago) by rtoy
Branch: unicode-utf16-extfmt-branch
CVS Tags: unicode-utf16-extfmt-2009-06-11
Changes since 1.1.2.12: +58 -109 lines
Diff to previous 1.1.2.12 , to branch point 1.1 , to next main 1.9
Refactor WRITE-UNIDATA by moving common code that writes ntries into
their own routines.

Revision 1.1.2.12 - (view) (annotate) - [select for diffs]
Tue Jun 9 13:07:50 2009 UTC (4 years, 10 months ago) by rtoy
Branch: unicode-utf16-extfmt-branch
Changes since 1.1.2.11: +159 -57 lines
Diff to previous 1.1.2.11 , to branch point 1.1
Add CaseFolding.txt to unidata.bin so we can do case-insensitive
comparisons according to Unicode.

i18n/CaseFolding.txt:
o New file

code/unidata.lisp
o Add new slots to the unidata structure to hold the simple and full
  case-folding information.
o Add UNICODE-CASE-FOLD-SIMPLE and UNICODE-CASE-FOLD-FULL functions to
  return the case-folded codepoint or string for the simple and full
  options, respectively.

tools/build-unidata.lisp:
o Add new slots to the unidata structure and the ucdent structure to
  hold the case folding information from CaseFolding.txt.
o Update routines to read the case folding data and to write the data
  to unidata.bin.
o Speed optimization: Use a hash table whose key is the codepoint and
  whose value is the index into the vector.  This preserves the
  structure of the code but vastly improves the speed of reading and
  processing the unicode data files, especially for the derived
  normalization properties.  (We should just replace the vector with
  the hash table.)

Revision 1.1.2.11 - (view) (annotate) - [select for diffs]
Fri Jun 5 16:22:09 2009 UTC (4 years, 10 months ago) by rtoy
Branch: unicode-utf16-extfmt-branch
Changes since 1.1.2.10: +111 -21 lines
Diff to previous 1.1.2.10 , to branch point 1.1
tools/build-unidata.lisp:
o Add support for reading SpecialCasing.txt to support full-casing
  operation.  (Currently does not support language-specific cases or
  context dependent cases.)
o Update some prints
o Add check to write-unidata to produce an error if we try to write
  more objects than we have allocated space for in the index table.

code/unidata.lisp:
o Support loading the full case tables
o Add functions to produce the full case string for a codepoint.

Revision 1.1.2.10 - (view) (annotate) - [select for diffs]
Fri May 29 16:12:40 2009 UTC (4 years, 10 months ago) by rtoy
Branch: unicode-utf16-extfmt-branch
CVS Tags: unicode-snapshot-2009-06
Changes since 1.1.2.9: +70 -43 lines
Diff to previous 1.1.2.9 , to branch point 1.1
tools/build-unidata.lisp:
o Read composition exclusions from the composition exclusions files
  and save it in unidata.bin.

code/unidata.lisp:
o Read composition exclusions from unidata.bin
o Use the exclusions from unidata.bin  instead of using the
  hand-initialized list.

i18n/unidata.bin:
o Updated with composition exclusions list.

Revision 1.1.2.9 - (view) (annotate) - [select for diffs]
Mon May 25 20:08:29 2009 UTC (4 years, 10 months ago) by rtoy
Branch: unicode-utf16-extfmt-branch
Changes since 1.1.2.8: +105 -2 lines
Diff to previous 1.1.2.8 , to branch point 1.1
Add support for quick check normalization properties.  (From Paul.)

i18n/DerivedNormalizationProps.txt:
o New file containing the normalization data we need.

tools/build-unidata.lisp:
o Read the normalization properties and build unidata.bin to include
  four new tries, one each NFC/NFKC/NFD/NFKD.
o Add new 1 and 2 bit tries.

code/unidata.lisp:
o Read the new data
o Add new functions to return the quick check normalization data.

code/stream-vector-io.lisp:
code/stream.lisp:
o Add support for 1, 2, and 4 bit vectors for stream I/O.

Revision 1.1.2.8 - (view) (annotate) - [select for diffs]
Mon May 11 16:44:01 2009 UTC (4 years, 11 months ago) by rtoy
Branch: unicode-utf16-extfmt-branch
Changes since 1.1.2.7: +31 -13 lines
Diff to previous 1.1.2.7 , to branch point 1.1
o Add constants for the magic number and the Unicode major, minor, and
  upgrade version to make the code slightly easier to read.
o Add optional arg to BUILD-UNIDATA to allow user to specify where the
  Unicode files are.  (Requires updating READ-DATA and FOREACH-UCD.)
  This is useful when the original default directory doesn't work.

Revision 1.1.2.7 - (view) (annotate) - [select for diffs]
Thu May 7 03:24:02 2009 UTC (4 years, 11 months ago) by rtoy
Branch: unicode-utf16-extfmt-branch
Changes since 1.1.2.6: +2 -2 lines
Diff to previous 1.1.2.6 , to branch point 1.1
Document the magic number for the unidata.bin file.

Revision 1.1.2.6 - (view) (annotate) - [select for diffs]
Fri May 1 11:41:02 2009 UTC (4 years, 11 months ago) by rtoy
Branch: unicode-utf16-extfmt-branch
CVS Tags: unicode-snapshot-2009-05
Changes since 1.1.2.5: +27 -22 lines
Diff to previous 1.1.2.5 , to branch point 1.1
Updates from Paul:

build-unidata.lisp:
o Fix bug in PACK-DECOMP which was not computing surrogate pairs
  correctly.
o Fix up and add some comments.
o Move the NameAliases, NormalizationCorrections, and BidiMirroring to
  READ-DATA.

unidata.bin:
o Updated to reflect corrections in PACK-DECOMP

Revision 1.1.2.5 - (view) (annotate) - [select for diffs]
Sun Apr 19 04:15:27 2009 UTC (4 years, 11 months ago) by rtoy
Branch: unicode-utf16-extfmt-branch
Changes since 1.1.2.4: +3 -3 lines
Diff to previous 1.1.2.4 , to branch point 1.1
More updates from Paul:

	changes the order [of the unicode categories], which fixes
	some bugs, too.  Need to rebuild unidata.bin once more.

Revision 1.1.2.4 - (view) (annotate) - [select for diffs]
Wed Apr 15 21:19:06 2009 UTC (5 years ago) by rtoy
Branch: unicode-utf16-extfmt-branch
Changes since 1.1.2.3: +107 -9 lines
Diff to previous 1.1.2.3 , to branch point 1.1
Another update from Paul:

	added combining class, bidi info, and Unicode 1.0 names -
	that's everything from the base UnicodeData.txt (and a few
	additions).

New files: BidiMirroring.txt and NormalizationCorrections.txt

Updated unidata.bin too.

Revision 1.1.2.3 - (view) (annotate) - [select for diffs]
Wed Apr 15 14:41:56 2009 UTC (5 years ago) by rtoy
Branch: unicode-utf16-extfmt-branch
Changes since 1.1.2.2: +80 -33 lines
Diff to previous 1.1.2.2 , to branch point 1.1
Updates from Paul:

	add numeric values and decompositions to unidata, added
	char-titlecase, and made string-capitalize use title-case
	rather than upper-case, when those are different.

The unidata.bin file needs to be rebuilt, and a cross-compile needs to
be done to support the new unidata.bin format.

Revision 1.1.2.2 - (view) (annotate) - [select for diffs]
Tue Apr 14 20:55:12 2009 UTC (5 years ago) by rtoy
Branch: unicode-utf16-extfmt-branch
Changes since 1.1.2.1: +275 -144 lines
Diff to previous 1.1.2.1 , to branch point 1.1
New implementation of the unidata structures from Paul.  He says he

    changed the implementation to use a three way split of the
    codepoint instead of binary search, renamed a few things, altered
    the way it encodes the general category information slightly, so
    that "Cn" (nonexistent character) turns into #x00 (was #x08), and
    fixed the case-conversion code (which ignored titlecase
    characters).

Updated unidata.bin too with the new data.

Revision 1.1.2.1 - (view) (annotate) - [select for diffs]
Sat Apr 11 12:04:27 2009 UTC (5 years ago) by rtoy
Branch: unicode-utf16-extfmt-branch
Changes since 1.1: +513 -0 lines
Diff to previous 1.1
Import Paul's new routines for storing and accessing the Unicode
data.

i18n/NameAliases.txt:
o New file:  Unicode NameAliases

tools/build-unidata.lisp:
o New file: Reads UnicodeData.txt and NameAliases.txt and creates
  unidata.bin that is accessed by Lisp to obtain unicode information.

code/unidata.lisp:
o New file:  Lisp interface to unidata.bin

code/char.lisp:
o Updated to use the new interface

code/print.lisp:
o Can't set up characer-attributes array with full Unicode data at
  startup because the search-list isn't set up yet.  Hence, only
  initialize part of the array, and use an
  *after-save-initializations* function to fill array with Unicode
  data after the search-list has been initialized.

compiler/srctran.lisp:
o Update deftransforms to use the new interface.

tools/make-main-dist.sh:
o Copy unidata.bin into the distribution.

tools/worldbuild.lisp:
o Load unidata.lisp

tools/worldcom.lisp:
o Compile unidata.lisp

Revision 1.1
Sat Apr 11 12:04:27 2009 UTC (5 years ago) by rtoy
Branch: MAIN
Branch point for: unicode-utf16-extfmt-branch
FILE REMOVED
file build-unidata.lisp was initially added on branch unicode-utf16-extfmt-branch.

This form allows you to request diffs between any two revisions of this file. For each of the two "sides" of the diff, select a symbolic revision name using the selection box, or choose 'Use Text Field' and enter a numeric revision.

  Diffs between and
  Type of Diff should be a

Sort log by:

  ViewVC Help
Powered by ViewVC 1.1.5