GSoC: LALR Parsing: School's out for summer.

I'm rather behind on my blogging schedule, and also somewhat behind on my coding schedule, thanks to finals and packing and moving out of the dorm, but there is good news: finals are over, and now I'm free to focus more on summer of code. In this blog post, I'll describe the representation I've decided on for grammars, and talk about my next steps.

Another piece of good news, I've actually decided on a name: LALRskate. dukeleto++ suggested it, and given that it means that I could name some debugging tool lalrcat, I had no choice but to go with it.

I'm going to be dealing with grammars a lot in my parser generator, both the creation of (for testing purposes) and the manipulation of (in the parser generator itself), so it'll pay to have a representation that's convenient to work with. I think I've arrived at such a representation, although there are a few implementation changes that I'd like to make at some point. A grammar is typically defined as a tuple of a set of nonterminal symbols, a set of terminal symbols, a nonterminal symbol which is the start symbol, and a set of production rules. In the case of context-free grammars (of which LALR(k) grammars are a subset), those production rules map single nonterminal symbols to a sequence of nonterminal or terminal symbols. This definition of grammars is the basis for the representation I'm using for grammars.

A LALR.Grammar object has a set of nonterminal symbols and a set of terminal symbols, each of which may be accessed as an immutable array (technically, you currently can modify it, but the changes won't be propagated back to the grammar; making it so that attempted modifications throw exceptions is one of my planned implementation changes mentioned above) using the .nonterminals() or .terminals() method. Symbols are represented as strings. The generated parsers will probably end up using a representation with faster comparisons, but from the grammar's perspective, strings are fine. You can check that a particular symbol is a terminal or nonterminal within a grammar with the .is_nonterminal(symbol) or .is_terminal(symbol) methods. If you want to know whether a symbol is either a terminal or a nonterminal, you can use the .is_symbol(symbol) method. Adding nonterminal or terminal symbols is done with the .add_nonterminal(symbol) or the .add_terminal(symbol) methods.

A LALR.Grammar has a start symbol, which can be set or queried using the .start(symbol) or .start() methods. This symbol must be in the grammar's set of nonterminals; otherwise an exception is thrown.

Finally, a grammar has a set of production rules. Since for the grammars we are interested in, each production will have a single nonterminal symbol as its left-hand side, the productions are only accessible through the relevant left-hand side nonterminal symbol. Specifically, the .rule(symbol) method returns an array of arrays where each inner array represents the right-hand side of a single production rule with symbol as the left-hand side. Currently, modifying either the outer array or the inner array will change the grammar, but this should be a considered a bug, and will be fixed eventually by making attempts to modify the result of .rule throw exceptions.

I've written code to build a few example grammars and a pretty-printing module that displays a grammar in a syntax partially inspired by BNF and partially inspired by yacc. The grammar representation seems to be convenient enough in both cases. Now, I need to get to work on actually generating parsers.

A helpful aspect of the approach to parsing LALR(k) or even general LR(k) grammars described in "Practical Translators for LR(k) Languages" is that it is just a generalization of its approach to parsing simpler grammars (for example, LR(0) grammars). So, my immediate goal is to generate the parsing tables for LR(0) languages, In Section 7.2, it describes a process for generating parsers. For LR(0) grammars, this process involves building a CFSM("Context-Free State Machine") related to the grammar's characteristic strings, verifying that the CFSM has no inadequate states, replacing #-transitions with transitions to look-back states, and various optimizations. Tomorrow, I intend to write a class to represent the CFSM (and the deterministic pushdown automata that it will be transformed into). Tomorrow and Thursday, I will write the code to generate the CFSM. On Friday and Saturday I plan to implement the look-back bits.

After that, I face the question of whether to stop at the parsing table or to go ahead and build something that will let me test that the DPDA is actually correctly parsing the grammars. I suspect that writing a simple interpreter for the DPDAs wouldn't take more than two or three days, and it would be helpful until I actually implement code generation. I plan to ask on #parrot and parrot-dev for opinions on this, but I'm leaning towards writing the DPDA interpreter currently.

That's all for now. Happy hacking, folks.