Dynamic code points, the grapheme tables and not getting your services denied.

One of the features of NFG I've mentioned before is that it solves our problems with variable width characters without taking additional storage space over UCS-4, on top of which it's defined. The artifact that allows us to pull off that trick is called the 'grapheme table' and today, I'll try to explain how it works.

I've mentioned before that when converting a string into NFG we will dynamically create new codepoints for sequences of composing characters that do not map to a single Unicode code point. The information needed to turn that new codepoint back into a stream of valid unicode codepoints is stored in the grapheme table.
With one entry per created code point, and the relative rarity of graphemes that lack an Unicode code point, this table is expected to be quite small most of the time, you would have to be using some rather odd inputs for it to grow beyond a hundred entries. All of this assuming, of course, that you are not trying to break stuff intentionally.
A problematic assumption, isn't it? All it takes is a malicious input string and all of our NFG goodness goes poof! Fortunately there are a few measures that we can take to mitigate this problems. The easy solution to avoid global resource exhaustion by malicious parties is, in our case, to de-globalize the resource. If each string has it's own grapheme table, then the one malicious string can't impact the others, and we can still have all of our NFG goodness.
Sure, there's a cost to this approach, the string header gets one pointer larger, concatenation gets a bit more expensive since we need to merge the two grapheme tables and we have to generally pay more attention to our string operations when they involve negative codepoints, but it's not that big of a problem and we should have enough room to be clever if the need to optimize arises. A small price to pay for not getting your services denied, if you ask me.
I guess we'll find out soon enough, as today is the official GSoC 'start coding' date. Which means I'll start working on NFG proper, and we'll all have some nice grapheme tables here by next week. That's the plan anyway.

What about bad data?

Coming from the Perl 5 world, one of the most common problems to which the language is put is dealing with log files. There are thousands of kinds of log files, and Perl has probably been asked to parse all of them. Sometimes these files contain snippets of data that comes from ... somewhere. A program connecting to the wrong port and convincing the server that you wanted to log in as a user whose name was a 12k chunk of uninitialized network card buffer was probably my favorite. I found that on an old EISA system running an early version of Linux.

But more routinely, these files tend to get oddly truncated or a rogue pointer coughs up some line noise before a string. These are invalid, but the problem is that they're also important for us to read as close to the way they were intended as possible. Any log parser that gives up when it sees bad data is going to have a very short lifespan.

Now, throw on top of that the fact that malicious users will want to construct cases that compromise remote hosts that read data in such a dynamic way. How can we assure security-concerned coders that a maliciously crafted string won't cause us to explode our grapheme maps?