The extra pointer has to go.

I reached the midterms mostly as scheduled, and NFG is pretty much feature complete now. There's still some stuff to do here and there, but the 'big ticket' items are done. So, I have been looking at what needs to be done before the gsoc_nfg branch can be merged back into trunk. And that means giving a hard look at all of the places where I've cut corners and see if they can be made better. It's mostly minor stuff, like leaving out a cast, or not paying attention to const mismatches in a few places. Most of it is just a matter of code cleanup. Until you see the extra pointer I added to string headers.

The reason behind that pointer is that I needed a place to hang the grapheme table from. It's completely unused outside of the NFG Unicode encoding, and has caused a few problems in the code. A few string pointers aren't const anymore, because we have to adjust the 'extra' pointer if the grapheme table is reallocated. I had to add some hooks into the gc to properly dispose of the table when it collects a string header. The biggest problem, however, is memory usage. Every string header is now a pointer larger, and the pointer is completely unused for most cases. Parrot's trunk is already too memory hungry as it stands now, and I don't really feel comfortable merging back the NFG functionality until I can work that point out. Fortunately, I see a way out of this, and if I'm right, I can make parrot's string header's smaller than they are on trunk. How's that for an improvement?
The one caveat for this strategy is that it implies disabling parrot's COW, which sounds worse than it really is. For starters, ever since we moved to immutable strings we've been doing a lot less with COW than we used to, and I had conversations with bacek about some pathological behaviors where COW actually hurts us. The upside, if COW's not here anymore, is that we can do away with the bufstart/strstart distinction, and the header becomes smaller, saving memory and improving performance. We also save the space where we keep the buffer refcount (which sounds like a nice place to stash our grapheme table, doesn't it?) and maybe we can simplify the gc's handling of strings a bit.
Of course, as I mentioned before I'm not comfortable with merging before I can convince myself that the branch is a clear improvement over trunk, so there's quite a bit of benchmarking ahead.

const grapheme table?

With the removal of COW and immutable strings instead, doesn't that solve the problem of reallocated grapheme table with const strings? If not, perhaps you should use a handle instead of a pointer and suffer the extra lookup?

Another thing I have not grasped: how shared is the grapheme table? Is it 1) per interpreter, 2) per string or 3) somewhere in between? If not per string, is there some trick where the grapheme tables are held in a global indexed hash? Or perhaps strings that don't need the grapheme table are allocated from a pool that has a null grapheme table pointer?

Re: const grapheme table?

To your first question:
The particular 'constness' problem I was referring to was one I ran into with iterators. Even though strings are immutable once created and the code is smart enough to create the table in a single allocation at string creation time, you can still create a string by simply using an iterator and 'encode_and_advance'-ing codepoints into it. That way you can't tell in advance how much space to preallocate in the table, or even if there will be a table at all in the end.
In that case we can still have a table reallocation, which isn't really a big deal, except for the fact that our iterators have a "const STRING *" member, which kind of makes it hard to reallocate the table if it's stored in the string header. I considered the handle approach, but I *really* think we should make our string headers smaller, so I decided to punt by provisionally removing the 'const' until I could move the pointer off the header. Which shouldn't be too hard to do, if I've been doing my work properly :).
To the second question:
The grapheme table is per-string, and created as needed. Some NFG strings are just UCS-4 strings in disguise, for example, in those cases The grapheme table is NULL, same as a regular string. There are no special pools involved, nor I feel there should be, given that from an user standpoint, NFG is just another string encoding.