Home
| Calendar
| Mail Lists
| List Archives
| Desktop SIG
| Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings Linux Cafe | Meeting Notes | Linux Links | Bling | About BLU |
Jerry Feldman wrote: | | The problem with Unix/Linux is that it is still based on 8-bit characters, | and an internationalized program must be set up to use either 16-bit or | wider. Java was written where it's native character type is 16-bits which | is sufficient for a majority of languages, but not for Asian languages. But with UTF-8, this usually doesn't matter. Code that deals with 8-bit characters often works fine with UTF-8's multi-byte character. The reason is that most code only "understands" a few particular bytes, and passes the rest through unchanged. Thus, in my earlier music example, I mentioned a program I have called abc2ps, which converts the ABC music notation to postscript. This isn't a trivial task, and the program is a significant chunk of code. Examining the code shows that it parses its input one 8-bit byte at a time, and doesn't do anything to deal with multi-byte characters. But when I started testing it for problems with UTF-8 data, I couldn't find anything that didn't work. The UTF-8 multi-byte characters were all passed through to the PS without any damage that I could find. In retrospect, it seems that the original unix kernel was designed so that it has the same property. It treated most data as an unexamined stream of bytes. It only examined the data in a very few cases, primarily in the open() routine's parsing of file names. But there, the only special characters were NULL and '/'. Every other bit pattern was legal in a file name, because the characters' values weren't examined at all. A true unix kernel will happily accept file names that contain things like control characters. Only the 0x00 and 0x2F bytes are "seen" by the parser; the other 254 codes are all legal in file names. The abc2ps program that I'm using mostly only "sees" white space. The part of a file that is music notation has a few other significant characters, such as the letters A-G, and a-g, digits, and a few others. But this part of the text is limited to just those characters, and others are ignored. Embedded text (things like "Gm", "Edim", "fine", "Coda", etc.) can be arbitrary text, but are set off by quotes of various sorts, and again the string is passed to the PS unexamined. Lines that start with an ASCII letter folllowed by a colon are recognized, but the rest of such lines are mostly arbitrary text. And so on. In general, this seems to work quite well. No byte in a UTF-8 multi-byte character is recognized as significant, so they either get ignored or passed on unchanged. This doesn't work with UTF-16, of course. For starters, in ASCII text you have a NULL for every other byte. That does take special routines that convert the input into an internal form that can't be type "char". But this was why Ken Thompson invented UTF-8. He seems to have almost (but not quite) succeeded in figuring out how to do multi-byte characters without any need of rewriting 8-bit code. And he did it, of course, by figuring out which 8-bit bytes had to be avoided in the character-encoding scheme. The UTF-8 scheme is somewhat sparse as a result. -- _, O John Chambers <:#/> <jc at trillian.mit.edu> + <jc1742 at gmail.com> /#\ in Waltham, Massachusetts, USA, Earth | |
BLU is a member of BostonUserGroups | |
We also thank MIT for the use of their facilities. |