i18n

Fri Mar 17 10:38:17 EST 2006

Ed Hill wrote:
>>
>>The problem with Unix/Linux is that it is still based on 8-bit characters, 
>>and an internationalized program must be set up to use either 16-bit or 
>>wider. Java was written where it's native character type is 16-bits which 
>>is sufficient for a majority of languages, but not for Asian languages.
> 
> The above, as written, is simply not true.  UTF-8 is a perfectly valid
> Unicode encoding and, for the characters that match the ASCII 0x00 to
> 0x7F, it uses the *identical* 8bits/character encoding and is therefore
> largely (read: as much as possible) backwards-compatible with older
> programs, text files, etc.

The standard Unix string-handling libraries don't know from UTF-8, so, 
for example, they will assume that every character is one byte wide.

You could encode "avi)Bón.txt" in UTF-8 and use it as a file name, and a 
terminal window configured to use UTF-8 would be able to display that 
name.  But in order for "ls avi?n.txt" to work, the shell's globbing 
algorithm would have to recognize that "\xc3\xb3" is the single UTF-8 
character ")Bó" (and not, say, the two ISO-8859-1 characters "$)BÃ³").