i18n
Seth Gordon
sethg at ropine.com
Fri Mar 17 10:38:17 EST 2006
Ed Hill wrote:
>>
>>The problem with Unix/Linux is that it is still based on 8-bit characters,
>>and an internationalized program must be set up to use either 16-bit or
>>wider. Java was written where it's native character type is 16-bits which
>>is sufficient for a majority of languages, but not for Asian languages.
>
> The above, as written, is simply not true. UTF-8 is a perfectly valid
> Unicode encoding and, for the characters that match the ASCII 0x00 to
> 0x7F, it uses the *identical* 8bits/character encoding and is therefore
> largely (read: as much as possible) backwards-compatible with older
> programs, text files, etc.
The standard Unix string-handling libraries don't know from UTF-8, so,
for example, they will assume that every character is one byte wide.
You could encode "avi)Bón.txt" in UTF-8 and use it as a file name, and a
terminal window configured to use UTF-8 would be able to display that
name. But in order for "ls avi?n.txt" to work, the shell's globbing
algorithm would have to recognize that "\xc3\xb3" is the single UTF-8
character ")Bó" (and not, say, the two ISO-8859-1 characters "$)Bó").
More information about the Discuss
mailing list