Boston Linux & Unix (BLU) Home | Calendar | Mail Lists | List Archives | Desktop SIG | Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings
Linux Cafe | Meeting Notes | Linux Links | Bling | About BLU

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

i18n



Ed Hill wrote:
>>
>>The problem with Unix/Linux is that it is still based on 8-bit characters, 
>>and an internationalized program must be set up to use either 16-bit or 
>>wider. Java was written where it's native character type is 16-bits which 
>>is sufficient for a majority of languages, but not for Asian languages.
> 
> The above, as written, is simply not true.  UTF-8 is a perfectly valid
> Unicode encoding and, for the characters that match the ASCII 0x00 to
> 0x7F, it uses the *identical* 8bits/character encoding and is therefore
> largely (read: as much as possible) backwards-compatible with older
> programs, text files, etc.

The standard Unix string-handling libraries don't know from UTF-8, so, 
for example, they will assume that every character is one byte wide.

You could encode "avi)B?n.txt" in UTF-8 and use it as a file name, and a 
terminal window configured to use UTF-8 would be able to display that 
name.  But in order for "ls avi?n.txt" to work, the shell's globbing 
algorithm would have to recognize that "\xc3\xb3" is the single UTF-8 
character ")B?" (and not, say, the two ISO-8859-1 characters "$)B??").





BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org