NFG is (somewhere in the vincinity of...) here

"N̈" (Or "n̈" if you don't like caps) is a grapheme from several minor extended Latin alphabets. It occurs in the orthographies of the Jacaltec Mayan dialect, Cape Verdean Creole, and in the rockumentary "This is Spın̈al Tap". Today I want to talk about the injustices this symbol has faced in the past and how, starting today, parrot can right them.

I am of the opinion that the movie, sorry rockumentary, alone should be enough to earn this little grapheme a spot on most character sets. But life has been never easy for poor "N-diaeresis". He is not represented on computer keyboards (outside of ancient and possibly non-functional Mayan spaceships) nor is he available as an HTML entity. He even lacks a proper Unicode code point to call his own, his identity has been fragmented into the two code point sequence of 'LATIN CAPITAL LETTER N' and 'COMBINING DIAERESIS'.

Today, as part of my work for the 2010 Google Summer of Code, I've done my part to right this wrong. Look at the follwoing PIR code:

.include "stringinfo.pasm"

.sub 'main' :main
$S0 = unicode:"a\u0308n\u0308"
print "STRINGINFO_BUFLEN\t"
$I0 = stringinfo $S0, .STRINGINFO_BUFLEN
say $I0
print "STRINGINFO_BUFUSED\t"
$I0 = stringinfo $S0, .STRINGINFO_BUFUSED
say $I0
print "STRINGINFO_STRLEN\t"
$I0 = stringinfo $S0, .STRINGINFO_STRLEN
say $I0

print "\n"

$I0 = find_encoding 'nfg'
$S0 = trans_encoding $S0, $I0
print "STRINGINFO_BUFLEN\t"
$I0 = stringinfo $S0, .STRINGINFO_BUFLEN
say $I0
print "STRINGINFO_BUFUSED\t"
$I0 = stringinfo $S0, .STRINGINFO_BUFUSED
say $I0
print "STRINGINFO_STRLEN\t"
$I0 = stringinfo $S0, .STRINGINFO_STRLEN
say $I0
.end

In the gsoc_nfg branch this outputs:

STRINGINFO_BUFLEN 56
STRINGINFO_BUFUSED 6
STRINGINFO_STRLEN 4

STRINGINFO_BUFLEN 16
STRINGINFO_BUFUSED 8
STRINGINFO_STRLEN 2

See the value for STRINGINFO_STRLEN at the bottom of the NFG version? It means there's only two graphemes there. They are "ä" and "n̈", two first class citizens of the new NFG world. On the less bright side of things there's also segfaults lurking in there and a big wad of crappy and inefficient code waiting to be cleaned up.

Still, we have a grapheme table an we can (kind of) transcode strings to NFG with NFC as a stepping stone. It is, so far, a straightforward table, with a bit of hashing thrown in, but if our expectations hold up to actual usage we might be able to get away with not doing anything more complicated. At least, I hope so.

Looking at my initial estimates for the project schedule, assuming the table code can be cleaned up in under a week, I am basically on schedule and, with some luck, I might be able to get a bit ahead before the midterms. But that's enough optimism for one day, I still have coding to do.