UCS-4, NFG and how the grapheme table makes it awesome.

After laying down the foundations of what NFG does in previous blog posts I've started implementing, as part of my work in this Summer of Code, a new Unicode encoding for parrot, UCS-4. In this post I'll try to explain what it is and how it makes NFG easier to achieve.

UCS-4 was defined in the original ISO 10646 standard as a 31-bit encoding form in which each encoded character in the Universal Character Set (That's the UCS in the name, for those wondering.) is represented by a 32-bit integers between 0x00000000. and 0x7FFFFFFF. Tremendously wasteful of your memory, but still useful since now you can fit any weird character you can think of in one 'position' and you can go back to doing O(1) random access into your strings. Mostly.

On paper UCS-4 looks like it was meant to solve all of our problems, but didn't. It solves a few of the problems that I've brought up in previous posts, like fixed-with encoding of codepoints, but it still leaves some unfinished business. The biggest issue is one I've brought up before too, combining characters. Sure, if you make sure all of your Unicode strings are properly normalized you won't have to worry about 'LATIN CAPITAL LETTER N' + 'COMBINING TILDE' screwing up your indexing, that will get composed into a single 'LATIN CAPITAL LETTER N WITH TILDE', but what happens when some clever guy decides to put a 'heavy metal umlaut' on the letter 'N'? 'LATIN CAPITAL LETTER N' + 'COMBINING DIAERESIS' does not compose into any valid codepoint, and that means that we can't really assume that one codepoint == one symbol anymore. And this is the place where NFG comes in.

I've left out the details out in my previous posts with the hope that explaining the problem first would make it simpler to explain the solution later, but it seems I can't put it off any longer. NFG deals with the combining character problem by guaranteeing that all possible combinations will have a unique codepoint. No, not by using even wider characters, what NFG does is use the leftover space in UCS-4.

Coincidentally, all of the 'leftover' codepoints are, if you treat them like 32-bit signed integers, negative. Which is very useful when we want to tell them apart from 'regular' codepoints. Even if it messes up our sorting order. The next problem is that the space for un-handled compositions is, as experts like to say, potentially unbounded. At which point we settle for the pragmatic approach of handling them as they come. Whenever we encounter a composing character that Unicode can't handle, we create new entry in the string's grapheme table and replace the whole substring with the newly-determined codepoint, so it only takes up one 'slot' and the last unaddressed problem in UCS-4 goes "POOF!"

Now we've covered all of the design and rationales of NFG from an Unicode standpoint it's time to start looking at it from a parrot standpoint. But I'll leave that for my next post.