i18n

John Chambers jc at trillian.mit.edu
Mon Mar 20 19:41:16 EST 2006


Jerry Feldman wrote:
|
| The problem with Unix/Linux is that it is still based on 8-bit characters,
| and an internationalized program must be set up to use either 16-bit or
| wider. Java was written where it's native character type is 16-bits which
| is sufficient for a majority of languages, but not for Asian languages.

But with UTF-8, this usually doesn't matter.  Code  that  deals  with
8-bit  characters often works fine with UTF-8's multi-byte character.
The reason is that most code  only  "understands"  a  few  particular
bytes, and passes the rest through unchanged.

Thus, in my earlier music example,  I  mentioned  a  program  I  have
called  abc2ps,  which converts the ABC music notation to postscript.
This isn't a trivial task, and the program is a significant chunk  of
code.   Examining  the  code shows that it parses its input one 8-bit
byte at a time, and doesn't  do  anything  to  deal  with  multi-byte
characters.   But  when  I started testing it for problems with UTF-8
data, I couldn't find anything that didn't work. The UTF-8 multi-byte
characters were all passed through to the PS without any damage that
I could find.

In retrospect, it seems that the original unix kernel was designed so
that it has the same property.  It treated most data as an unexamined
stream of bytes.  It only examined the data  in  a  very  few  cases,
primarily  in the open() routine's parsing of file names.  But there,
the only special characters were  NULL  and  '/'.   Every  other  bit
pattern  was  legal  in  a  file name, because the characters' values
weren't examined at all.  A true unix kernel will happily accept file
names that contain things like control characters.  Only the 0x00 and
0x2F bytes are "seen" by the parser; the  other  254  codes  are  all
legal in file names.

The abc2ps program that I'm using mostly only "sees" white space. The
part  of  a  file  that is music notation has a few other significant
characters, such as the letters A-G,  and  a-g,  digits,  and  a  few
others.   But  this  part  of  the  text  is  limited  to  just those
characters, and others are ignored.  Embedded text (things like "Gm",
"Edim",  "fine", "Coda", etc.) can be arbitrary text, but are set off
by quotes of various sorts, and again the string is passed to the  PS
unexamined.   Lines  that  start  with an ASCII letter folllowed by a
colon are recognized, but the rest of such lines are mostly arbitrary
text.  And so on.  In general, this seems to work quite well. No byte
in a UTF-8 multi-byte character is recognized as significant, so they
either get ignored or passed on unchanged.

This doesn't work with UTF-16, of course. For starters, in ASCII text
you have a NULL for every other byte. That does take special routines
that convert the input into an  internal  form  that  can't  be  type
"char".   But  this was why Ken Thompson invented UTF-8.  He seems to
have almost (but not quite) succeeded  in  figuring  out  how  to  do
multi-byte  characters without any need of rewriting 8-bit code.  And
he did it, of course, by figuring out which 8-bit  bytes  had  to  be
avoided  in  the  character-encoding  scheme.   The  UTF-8  scheme is
somewhat sparse as a result.


--
   _,
   O   John Chambers
 <:#/> <jc at trillian.mit.edu>
   +   <jc1742 at gmail.com>
  /#\  in Waltham, Massachusetts, USA, Earth
  | |



More information about the Discuss mailing list