Encodings, charsets and how NFG fits in there.

The post from last week talked about what NFG was and tried to explain how it was a good feature for parrot to have. Today I'll be slightly more concrete and talk a bit about how NFG fits inside the parrot string structure. There's other parts of parrot that will need hacking on, but this time I want to limit myself to the the two bottommost pointers in the STRING structure definition and the concepts behind them.

I'm referring to "encoding" and "charset" and they deal, oddly enough, with encodings and charsets. As usual, there is a twist here. Parrot's notion of 'charset' and 'encoding' might not exactly be what you expect from the name. The canonical definition is contained in Parrot Design Document number 28, "Strings". I sleep with that under my pillow.
The 'charset' deals with the set of characters that is used in the string, it can be one of ASCII, ISO-8859-1, Unicode or binary[1]. In Unicode terms it is a character repertoire, a collection of characters, a well-defined subset of the Unicode character space.
It is the charset's job to know everything there is to know about the characters that make up the string: composition, decomposition, case changes, character classes and whether a given codepoint is a member of the character set. It also knows how to convert strings to and from other charsets, but charsets are all blissfully ignorant on one small detail: The representation of characters inside the string.
That's the part that the 'encoding' is expected to handle. The encodings available to parrot today are: fixed_8, utf8, ucs2 and utf16. And at first glance you might think "Hey! Four charsets and four encodings? I bet you can pair them and save a pointer." but you can't. If you look closer, utf8, utf16 and ucs2 are there for Unicode. That multiply-encoded scoundrel!
Summing up, the encoding encapsulates away all of the nasty details of representing character in memory, so that the chrset can pretend they don't exist and happily do all of it's character manipulations without worries. So, how does NFG fit here? Is it a charset? Is it an encoding? It is a bit of both.
First of all it is a Normal Form on top of Unicode, it defines a new representation for Unicode strings with weird[2] compositions in them. That forces us to create a new encoding. And it could very well end there. New normal form, new encoding and all is good.
But it also might not end there. We are, in effect, defining new characters that are outside of the Unicode character space, and that means charsets. Fortunately, as NFG is built on top of Unicode, it could be handled by extending the current Unicode charset code. But the point stands, NFG gives us a bigger character set, and a new encoding to go with it. I know it doesn't sound very useful when said like that, but it is. I'll try to cover that on my next post, hopefully, next week.

[1] Technically, binary isn't really a charset but rather the absence of one, it means "I'm just a stream of bytes, no characters here, please move along now."
[2] Where weird means, there is no Unicode codepoint for the composition of those two (or more) characters.