PLEX86  x86- Virtual Machine (VM) Program
 Plex86  |  CVS  |  Mailing List  |  Download  |  Computer Folklore

Thou shalt have no other gods before the ANSI C standard 1354


Steve O'Hara-Smith

Oh ... it would take me a while to wade through the stuff again. Let me just recall the issue for you. Basically, Unicode cannot handle interoperable localization of Chinese, Korean and Japanese. If you know Unicode, then you'll know that Chinese Korean and Japanese characters were merged into common code points called "Han", or specifically UniHan. This is all fine and well as far as representation and efficiency because it turns out that those language share many code points. The problem comes from the fact that those languages do not share *representations* of those characters. That is to say, you are supposed to write them differently depending on what language you are writing in. So, a Chinese person can use unicode for just chinese, so long as they are using a chinese font. But as soon as that person wants to incorporate some Japanese or Korea text, they need to *switch* fonts. This is a unique problem for CJK. All other language combinations allow for one language to be written along side another language using a single font.

I don't know Chinese, Japanese or Korean, so I personally cannot give a natural example, however let me give you an example of something that *does* work:

My french friend Jacques said: "Marecelo m'a montrŽ une image appelŽe

Thou shalt have no other gods before the ANSI C standard 1355
CBFalconer If you were to pack the currently buttigned Unicode code points and remove the private data areas, then yes you would be right. But the private data areas (which are themselves 17...
Thou shalt have no other gods before the ANSI C standard 1356
snip lots of background ... Oh well, as a computer architect I don't need all the details, just the minimum size required...

That's three languages in one sentence, and there is no ambiguity. If you know the languages, then you can read it with no troubles -- obviously no font switch is required, and there's no problem encoding that in Unicode. No meaning is lost. Now try doing that with Chinese, Japanese and Korea. I am sure there might be many sentence combinations that work, but because the pictograms for the same character *look* different in the different languages, there are at least *SOME* combinations that don't work as intended, and cannot be fixed in a context independent way.

Thou shalt have no other gods before the ANSI C standard 1357
At the moment, I'm using a hypothesis that the guy is a Billyboy. So far, his posts have matched this hidden agenda. How does...

That is to say, for simple disambiguation between Chinese, Japanese and Korean all in the same stream of data you need some kind of meta information, such as font information. This is unique to those three languages within Unicode, and there is no specifically documented Unicode method for encoding font or language selection meta-data (obviously, Unicode is mandated *NOT* to encode text in that way). So while using just Unicode for any other combinations of languages in a single data stream presents no problem whatsoever, CJK are screwed. Your only recourse under Unicode is to resort to using the "private data areas" to encode some application specific encoding to deal with this disambiguation.

The Unicode people themselves were confronted with this issue, and rather than acknowledging it as an action item (Unicode is still a standard with ongoing activity) they give excuses like saying "Unicode is not meant to do that. It just gives the raw encodings, how you are supposed to read it is up to developers." There are links to these statements on the Unicode site somewhere, but I don't have them at my fingertips.

But you can see the problem right? I am a programmer, and would like to be able to write programs that handle *ANY* text, even if I personally can't understand the contents of that text. But to properly develop programs that do such I think, what I want is at least one *SINGLE* font in which I can at least render *ANY* Unicode data. I can only do so, if a priori, I choose to limit myself to *ONE* of the languages of Chinese, Japanese and Korean. You see? I don't even speak any of those languages, and its a problem for me.

-- Paul Hsieh


List | Previous | Next

Thou shalt have no other gods before the ANSI C standard 1355

Alt Folklore Computers Newsgroups

Thou shalt have no other gods before the ANSI C standard 1353