![]() |
Home
| Calendar
| Mail Lists
| List Archives
| Desktop SIG
| Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings Linux Cafe | Meeting Notes | Linux Links | Bling | About BLU |
{Long message; sorry, no online repository I can give a link to.} (What's "i18n"? Answer: A quick way to type "internationalization", if you don't have auto-completion (like tab-completion) for routine text entry; most people don't. That's a 20-letter word.) Basically, entering Japanese is one specific aspect of a much-more-general topic, which is based on the fact that people who don't use our really-simple* 26-letter alphabet truly *do* often want to read and type in their own language and writing system, when using their computers. Can't blame them. *Except for caps and small letters, that is; many alphabets and alphabet-like writing systems don't have caps. I need to do a lot of study and "digging" for i18n in Linux; I want an OS (and e-mail composer) that can insert any Unicode character, and render most major writing systems. Right now, Libranet (nice Debian derivative) is very disappointing (ASCII only!) in its default configuration for e-mail composition in Opera (It's not Opera's fault, just about sure.) I'll be migrating to a newer machine RSN, and will really start to be serious about i18n when I do that. So far (I havent really tried), Libranet renders most writing systems just fine. (Yes, I do know about Yudit, although I tend to forget about it; shame.) *L18nux, anyone? Only ~90 Google hits! Opera's e-mail composer in Win is not even a text editor (it's pathetic), but has one endearing feature: Enter the "U+nnnn" hex code for an Unicode character, do an Alt-X, and, if it's in your fonts, it will be rendered, in-line. Combine that with MS Win's Alt-nnnn scheme for entering 8-bit characters in the current encoding, and I'm almost as happy as a clam in unpolluted wetland. Essentially, what follows is a bunch of thoughts, trying to stick to what's important, about i18n; it could help introduce the topic. At the end, I have a reply to Robert. Corrections are welcome! I'm an amateur and dilettante about this topic. === Being the son of a father born in (Tsarist!) Russia, and first-gen. American (English) mother, perhaps I'm a bit more of an internationalist (I didn't count the letters :) ) than many with native-born, say, great-grandparents or more. Around a decade ago, when I tried Xdenu Linux (command-line only!) (and Debian, also command-line, a few years later), but using a Moss Doss 6.22machine, I was active on the Early Music* mailing list, mirrored to rec.music.early (iirc!) and was horrified to see badly-munged European names; it was Moss Doss and Codepage 437. After researching matters, I learned about the Codepage 850 series, installed Codepage 850, and was further horrified to see different munging. *As in Handel, say, not early Beatles (which I love, too) Fellow known as Kosta Kostis, Greek heritage, German resident (and probably German citizen) hated seeing letters with umlauts (pairs of dots over a, o, and u) and the German [)B?] badly munged in DOS. He found out about Codepage 819, which is identical in its encoding and (implied? Actual?) character set with Latin-1, a.k.a. ISO-8859-1. Because DOS commands have hidden sanity checks to reject all but a small subset of probably several hundred codepages, he wrote his own replacement commands, and packaged them along with a set of codepages corresponding to ISO-8859-[n] for [n] up to 10, iirc. I got his .zip file, installed it, fooled around a bit, but finally installed all 10 or so. I still have a nice HP Vectra 386-16/N DOS machine that's '8859 compatible; can display Turkish, Icelandic, or Polish, but simply not all at once. The nice box-drawing characters do get lost, a tradeoffI was willing to live with. Alan Flavell <http://ppewww.ph.gla.ac.uk/~flavell/> (home page) was the first person whose online text clued me in to i18n and Codepage 819. Others I remember were Jukka Korpela (really good), Markus Kuhn, and Roman Czyborra. I became an Unicode hobbyist (no kidding). Studying Unicode, and studying about it, reawakened an interest in writing systems. While Unicode is not a divine gift (it's not universally loved, apparently especially in Asia), it's the most-widely-accepted way, and quite a good one, of making a computer work with other writing systems. (Btw, "writing", in this context, refers to typing, typesetting, printing, font design, etc.; it's far more that just handwriting, which is a small subset. The Yahoo Group Qalam (reed/pen in Arabic) is a mailing list for those really interested in the topic.) I politely harangued Opera software a few years ago, and might possibly have had a little influence in making their browser much more internationally-compatible, and sooner. I18n is a huge topic, and here in the USA, for reasons of cultural history, geographical isolation*, and sheer size, we are mostly a monolingual nation with by far the simplest major writing system there is. Malaysian/Indonesian and a very few other less-widely-spoken languages are the only others besides English that don't add diacritical marks routinely to some of their letters. *But, consider Canadian French near the Quebec border, and Spanish as our national second language, almost... When rendering (to screen or paper), we can simply take each byte of text and place it to the right of the previous one, accounting for line breaks either in flowed format (word-processor, and up-to date e-mail) or explicitly. Life, for us, is simple. The various ISO character sets worked well for limited ranges of languages, but for a truly global setup, Unicode is the way to go (unless you're in parts of Asia, afaik. :) ) The text at the beginning of the Unicode manual* is a capsule introduction to i18n as it affects computer text, including text preparation, storage, rendering, and sorting (collating?) sequences. It's not a general introduction ot writing systems, although it can be a shock at times to read about their essentials. *online in PDF form (very nice, too), but not as one huge file, afaik! <www.unicode.org> should get you started. In general, 16 bits per character works well, and the Unicode people do value common sense a lot. The lowest 7 bits of any 16-bit character will give you ASCII, regardless. Fun begins at the 8th bit, and ends only with extensions beyond 16 bits, which the Unicode people were sensible enough to provide for. One surprise is that quite a few important writing systems (in India, in particular) don't, strictly speaking, have true alphabets, although a casual look at their Unicode charts would make you think so. (They are technically abjads or abugidas, but I have had little luck committing to memory the definitions of those terms.) Each standalone character, just about always, has an inherent vowel, apparently just about always like our "ah". It's as if we had no vowels, and all our regular letters were called "ba, ca, da, fa, ga, ha, ja, ka, etc.". (The name "Jalal Talabani" would be quite concise, in such as system!) There are ways to write standalone consonants, using what are called (slang) "vowel killers", and also ways to write standalone vowels, of course. Then, one also learns that one cannot simply take the next byte, look up its bitmap, and render it. Many writing systems require rather-complicated schemes for analyzing byte strings locally and creating acceptable bitmap images. Arabic and Hebrew, of course, are written right-to-left (RtoL), and some real fun begins when you mix them with LtoR scripts (script: One word for "writing system"). Consider Arabic text, flowed format, with a fairly-long English quote in-line. Line breaks when rendering? Major change in column width? Editing the quote, maybe extending it? Unicode handles this (and probably all other matters concerning mixed-direction text) with the BiDi algorithm. (That's "bidirectional", of course, not the small-cigar-like bidis of India). Arabic presents special challenges. Scientific American had an excellent article about its use in computers roughly 15 years (?) ago. Arabic *has* to be rendered essentially as connected script, much like our handwritten text where letters are joined. Arabic rendered with standalone letters looks quite bad (much worse, afaik, than all caps in e-mail), and is probably a bear to read. (I don't know Arabic, only about it, in general.) Many letters in Arabic have no fewer than four forms, a good number of which are not recognizable as the same letter to unschooled eyes. (Those four forms are initial, medial, final, and standalone. Initial starts a word, medial is in the middle, and final ends a word, just to clarify.) Arabic letters also need to be joined, and the joining seems to be non-trivial. The whole process of rendering byte strings of text includes what are called "shaping and joining", and it's language-specific. At least in Win 9x, likely 2K, and probably XP, a DLL named Uniscribe (iirc! -- several versions exist; I'm using usp10.dll in 98 SE) takes care of these details for many major writing systems. Concerning the dozen or so major writing systems of India, most otherwise-universal software that's been internationalized still can't render them properly. Microsoft had some business arrangement (perhaps ownership) with a company in India that is doing something serious about the situation, however. I'm surely no MS fan, but they have done a lot of good, imho, for i18n and international computer typography. Their Arial Unicode font (no longer available) was (imho) a tremendous boon to the computer community, for one. The Typography section of their Web pages has been laudably un-commercial. (While on the topic, my Libranet seems to render many writing systems quite nicely. I'm very pleased about that. I did import fonts from 98 SE.) What it amounts to is that acceptable rendering of most of the world's writing systems requires writing-system-specific software to look acceptable. Furthermore, entering text in systems like Amharic (Ethiopian), Korean, or Chinese and Japanese, all of which have at least a few hundred rendered characters, using the keyboard we are accustomed to, requires what seem to be called "input method"* software that acts as an intermediary between keyboard and the end application. *MS term: Input Method Editor, IME Just today, I had a look at Mandriva 2K6 Live, and was happily surprised to see the variety of keyboard layouts. (Hey! Dvoraks in Scandinavia! There's hope, yet!) Can't say that without noting install language choices. Thought I knew the names of most significant languages; not so. (Some obscure ones are probably not "significant", but to their speakers, they are! A final thought: Thai does not use word spaces. Itstextisruntogetherlikethis. Afaik, such a simple matter as breaking lines of text properly* requires dictionary lookup! *I've seen hyphenated line breaks in recently-composed English-language texts that make me want to lose my last meal... AT LAST: ???????????? On Thu, 16 Mar 2006 14:17:16 -0500, Robert La Ferla <robertlaferla at comcast.net> wrote: > BTW - When sending e-mail in Japanese, use ISO-2022-JP for your encoding > to avoid complaints about mojibake. Imho, read and heed! I didn't know that. I'm extremely unlikely to send e-mail in Japanese, but it's one of those essentials (like knowledge of BCC) one really has to keep in mind when sending e-mail. As I understand it, (and I might well be wrong! Corrections welcome!) there are at least two basically-different ways to encode Japanese text; iirc, one (Shift-JIS? Apologies if I'm wrong) is something like the old {ltrs}/{figs} shift in 5-bit teleprinters -- one can be in the wrong mode. The consequence is that if a "mode-change" character is omitted, or wrongly sent when it should not be, (or munged...), all subsequent text (at least up to a redefining of "mode") is scrambled badly. If you think seeing English text in {figs} shift is bad, when you have a practical set of something like 2,300 or so basically-Chinese characters, and are receiving nonsense, as I understand it, that's mojibake. [Katakana] One can read more Japanese than one might, at first, expect. Japan has imported English words "wholesale", sometimes adapting them to their own language (I'm typing on a Compaq "pasokon" -- pasonaru konpyuutaa). Perhaps 35,000 words have been imported. These words are rendered/written with a simple syllabary called katakana, which (except for arbitrary-seeming, never-complicated character shapes) is about as easy to learn* as an alphabet, and can be a *lot* of fun. By no means whatsover is katakana anywhere near as difficult to learn as kanji (Japanese name for Chinese characters; kan = China, ji = characters). As to that "arbitrary", katakana started life as sometimes-complicated Chinese characters, used purely for phonetic purposes. I could go on, and on... Knowing katakana can be useful; if you learn it, or at least have a character chart*, you can read part of Japanese text. (Btw, in all, Japanese uses no fewer than four character sets in routine text, if you include romaji (roman/latin letters) as part of their writing system, which they are. Four is unique; it's max. of any.) *The character chart alone isn't quite sufficient, but helps considerably. Try Tuttle, publisher, for a small book about katakana for people going to Japan. (Just in case anyone checks, this is being sent via the Delicate Flower (skunk-cabbage flower?) OS, the one that never seems to run out of unimaginable ways to crash; however, Opera e-mail in that OS is *vastly* better for i18n than straight Libranet*! I'm currently using both. These days, at the beginning of a session, I am deciding which to boot first.) *Very nice Debian derivative, but likely to fade from the scene.) My regards to all, -- Nicholas Bodley [{(<>)}] Waltham, Mass. Midnight hacker (approved by management) in 1960, on an all-NAND-gate machine with a 19-bit word length; paper tape code was duotricenary (radix-32).
![]() |
|
BLU is a member of BostonUserGroups | |
We also thank MIT for the use of their facilities. |