The adventure begins

Hello, I'm Tyler Curtis (tcurtis on #parrot), and once again, I have the fortune to be doing Google Summer of Code for Parrot. This summer, I will be writing a LALR parser generator targetting Parrot. My proposal contains a tentative schedule and some amount of explanation of what my project involves. I plan to produce some further explanation in a future blog post, as well. In this one, I'll mostly talk about my current activities: reading papers about LALR parsing and preparing to actually start coding once I properly understand the theory.

Thus far I've been reading the paper which introduced LALR parsing, "Practical Translators for LR(k) Languages" by Franklin DeRemer. I've read the first four chapters. After some preliminaries on characteristic strings, canonical parses, and the definition of LR(k) grammars, DeRemer describes a series of equivalent algorithms for parsing LR(0) grammars (those grammars which can be parsed in a single scan of the input from left-to-right without any lookahead). The last of these algorithms uses a deterministic push-down automata. In Chapter 4, he expands these algorithms in turn to deal with SLR(k) grammars, which deal with ambiguities using lookahead sets that can be simply computed (the precise meaning of "simply" is not easy to concisely explain; simplicity is quite complicated). A particularly interesting property of these techniques for parsing SLR(k) grammars is that they involve no additional work compared to the corresponding techniques for LR(0) grammars when the grammar in question is indeed LR(0). I believe that a similar statement applies to the LALR(k) and general LR(k) parsing algorithms developed in Chapter 5, but having not yet read that chapter, I do not know for sure. After this, DeRemer describes a way of implementing such parsers and a parser generator for them using a table representation and an interpreter for these tables.

DeRemer's dissertation is really quite interesting, and I'm excited about reading the remaining few chapters; I'd certainly recommend it to anyone with an unhealthy fondness for parsing and some time on their hands (like me), even if they aren't going to get paid to implement it (unlike me). Actually, that last parenthetical is not quite true. I won't be implementing the exact parsing table construction technique from "Practical Translators for LR(k) Languages". After DeRemer's dissertation, some implementations were produced (including the famous YACC) and some other techniques for more efficiently calculating look-ahead sets were explored. DeRemer himself, collaborating with Thomas Pennello, developed an algorithm for calculating the look-ahead sets, published in 1982 as "Efficient Computation of LALR(1) Look-Ahead Sets", which is described as performing fewer than 15% of the number of set unions as the algorithm used by YACC on some grammars. After I finish reading DeRemer's dissertation, "Efficient Computation of LALR(1) Look-Ahead Sets" is next on my reading list. My current plan is to implement the algorithm it describes.

That's what I've been reading and plan to read, but in the first paragraph I promised you some discussion about my preparations for coding. Firstly, I set about setting up an environment for working on my project. I'm set up dedicated parrot, plumage, winxed, etc. installations for this project. I've also created a skeleton project which will eventually hold actual code/tests/documentation for the as-yet-unnamed project. I've also started painting some of the simpler bikesheds. I've chosen Winxed as both the implementation language and the target language for the parser generator. I've decided to use Rosella's testing tools for testing.

Unfortunately, one of the most important of bikesheds remains untouched: what will I name this thing? whiteknight++ and bubaflub++ provided me with some helpful suggestions on IRC: taflalrt(Tyler's Awesome Freaking LALR Thingy), LinewALkeR and whoLesALeR. But I don't think I'm going to use any of those; I don't feel like naming things after myself, and the latter two don't actually have any relation to LALR parsing other than containing LALR. I'd welcome more suggestions. Do you really want this to end up being named parrot-lalr? I will do it. Don't try me. I will. But maybe if someone provides some nice suggestions, either by email or on #parrot, the Parrot community will be spared this travesty of nomenclature.

That's all folks.