From: Matthew Flatt <mflatt@cs.utah.edu>
To: plt-scheme@po.cs.brown.edu
Subject: [plt-scheme] Unicode, take 3
Date: Thu, 1 Apr 2004 06:50:15 -0700

As it turns out, getting Unicode right has been even more difficult
than I expected. Among the consistent sources of trouble:

  - The mismatch between "character" and "code unit".

  - Dealing with character case (lowercase, uppercase, titlecase) and
    locale-sensitive case folding (the famous Turkish "i" problem).

  - "Fixed-width" fonts generally don’t exist outside of Latin-1. This
    is a problem, for example, in DrScheme.

  - Shifting text directions (left-to-right for Roman characters,
    right-to-left for Arabic, top-to-bottom for Chinese) has not been
    addressed at all for the editor.

I think I’ve hit on an approach that eliminates these issues and
others, but is still convenient for a many people.

Starting with version 299.4, MzScheme will support only the subset of
Unicode that represents Chinese characters (both simplified and
traditional). The advantages are obvious:

  - Each character has a single code point, with no question of
    combining accents, etc.

  - There’s no character case at all, so case-folding is moot.

  - All fonts are fixed width. Furthermore, the width of a character is
    the same as its height and determined by the font size, so
    text-measuring methods are unnecessary.

  - Text always runs top-to-bottom. Besides the uniformity this gives
    the editor, it’s particularly useful for Scheme, where nested
    structure often pushes code too far to the right (at least for
    those who insist on 79-column code). A top-to-bottom character
    layout eliminates this nesting affect.

For communicating with the rest of the world, MzScheme will still use
UTF-8 for encoding. (Popular Chinese-specific encodings, such as Big5
and GB tend to work with only simplified or traditional characters, and
we have no reason to pick one or the other.) Note that every Chinese
character is encoded with three bytes with UTF-8, effectively making it
a fixed-width encoding for our purposes.

The shift to Chinese may also improve DrScheme for pedagogic settings.
For example, we will no longer have to worry about students misspelling
"lambda" as "lamdba" (a classic problem that often confounds teaching
assistants).

Matthew