What is NFG and why you want Parrot to have it.

The Grapheme Normal Form for Unicode (or NFG as we like to call it) has been specified as a feature parrot wants for a long time, it's been in the Parrot Design Document for strings since before I had a commit bit, or any involvement in the project come to that. Something that has gone that long unimplemented can't be that important, right? I mean, we clearly have survived without it. Turns out it is important, but it takes some background to realize why.

Let's see what NFG is by showing the problems it solves, hopefully that will explain why we want it at the same time. I'll describe the implementation later, now I just want to provide an argument of its benefits without getting bogged down in implementation details.

So, let's say you have a string that, through no fault of your own, is in some form of Unicode. Okay, that's no problem, parrot can handle that. Except, of course, that depending on what the encoding is there won't be any O(1) random access for you. That's simply the nature of things when you don't have a consistent size for your characters.

This is hardly news, it first came up with UTF-8 right when it was invented. The solution adopted by the people who invented UTF-8, on the operating system they invented it for, was to avoid it as much as possible in string processing. Any program that was expected to run faster than molasses, starting with the regex libraries, was adapted to use 'Runes', trading memory inefficiency for O(1) random access to strings. A fine compromise if you ask me.

Of course, once you've solved that, you have to face the next problem. Even if you bite the bullet and pay the storage space for O(1) access you still get bitten by the fact that encoding "all the scripts in the word" is an ugly business with more edge cases than you thought were even possible.

Your next problem are 'composing characters', and boy are they fun. Let's say you want to do something really easy, something that could not possibly go wrong. Like compare two strings for equality. Yes, that's easy, and efficient, so long as your system's memcmp() knows that the sequence "{A, tilde, acute, dot_below}" must compare equal to "{A, tilde, dot_below, acute}" and to "(A-tilde-acute, dot_below}", which has a different length. Oh, and to "{A-tilde-acute-dot_below}" too, and few more, but you get the idea.

There is a way out of this mess, kind of. The Unicode standard defines several 'Normal Forms', that you can put your strings into. For example you can say "I want all of my strings to be always fully composed" or maybe you want fully decomposed, there's one for that too. As long as you make sure all of your strings are properly normalized, you get to compare for equality the easy way.

Except for the fact that even the "most composed" form Unicode gives you, can still leave you with loose composing characters in your string. Sounds like an unlikely scenario, true, but it does happen, and you have to handle it if you want to be correct in your handling of some languages. So you now have to handle a bunch of special cases on top of the price you are already paying for the normalization. I told you it was fun.

Fortunately, there is a solution for that too, it's called NFG and it works by taking normalization one step further. Essentially, we create a new normal form that dictates that every composing character get's composed, no exceptions. If the resulting character doesn't have a codepoint we assign it one dynamically.

Thats what NFG is about: Coercing Unicode into a form that, though a bit costly in setup, allows more efficient string handling through simpler code.