i18n
John Chambers
jc at trillian.mit.edu
Mon Mar 20 19:41:16 EST 2006
Jerry Feldman wrote:
|
| The problem with Unix/Linux is that it is still based on 8-bit characters,
| and an internationalized program must be set up to use either 16-bit or
| wider. Java was written where it's native character type is 16-bits which
| is sufficient for a majority of languages, but not for Asian languages.
But with UTF-8, this usually doesn't matter. Code that deals with
8-bit characters often works fine with UTF-8's multi-byte character.
The reason is that most code only "understands" a few particular
bytes, and passes the rest through unchanged.
Thus, in my earlier music example, I mentioned a program I have
called abc2ps, which converts the ABC music notation to postscript.
This isn't a trivial task, and the program is a significant chunk of
code. Examining the code shows that it parses its input one 8-bit
byte at a time, and doesn't do anything to deal with multi-byte
characters. But when I started testing it for problems with UTF-8
data, I couldn't find anything that didn't work. The UTF-8 multi-byte
characters were all passed through to the PS without any damage that
I could find.
In retrospect, it seems that the original unix kernel was designed so
that it has the same property. It treated most data as an unexamined
stream of bytes. It only examined the data in a very few cases,
primarily in the open() routine's parsing of file names. But there,
the only special characters were NULL and '/'. Every other bit
pattern was legal in a file name, because the characters' values
weren't examined at all. A true unix kernel will happily accept file
names that contain things like control characters. Only the 0x00 and
0x2F bytes are "seen" by the parser; the other 254 codes are all
legal in file names.
The abc2ps program that I'm using mostly only "sees" white space. The
part of a file that is music notation has a few other significant
characters, such as the letters A-G, and a-g, digits, and a few
others. But this part of the text is limited to just those
characters, and others are ignored. Embedded text (things like "Gm",
"Edim", "fine", "Coda", etc.) can be arbitrary text, but are set off
by quotes of various sorts, and again the string is passed to the PS
unexamined. Lines that start with an ASCII letter folllowed by a
colon are recognized, but the rest of such lines are mostly arbitrary
text. And so on. In general, this seems to work quite well. No byte
in a UTF-8 multi-byte character is recognized as significant, so they
either get ignored or passed on unchanged.
This doesn't work with UTF-16, of course. For starters, in ASCII text
you have a NULL for every other byte. That does take special routines
that convert the input into an internal form that can't be type
"char". But this was why Ken Thompson invented UTF-8. He seems to
have almost (but not quite) succeeded in figuring out how to do
multi-byte characters without any need of rewriting 8-bit code. And
he did it, of course, by figuring out which 8-bit bytes had to be
avoided in the character-encoding scheme. The UTF-8 scheme is
somewhat sparse as a result.
--
_,
O John Chambers
<:#/> <jc at trillian.mit.edu>
+ <jc1742 at gmail.com>
/#\ in Waltham, Massachusetts, USA, Earth
| |
More information about the Discuss
mailing list