GSoC: Wrapping up, and some documentation

The hard "pencils down" date was Monday, so now seems like a good time for a blog post summarizing what I ended up completing.

I have DPDA generation and parsing working for LR(0) and SLR(1) grammars. I have the beginnings of a grammar specification DSL (a grammar, but no actions or tokenizer yet; it's in the dsl branch). I do not have support for LALR(k) grammars or general LR(k) grammars. I have not implemented generating code to parse grammars (as opposed to interpreting the DPDA in order to parse them).

I am still planning to implement support for a wider class of grammars in the future, as well as code generation, but I can't guarantee when I'll have the tuits.

That is what my project can currently do, but it's currently lacking much documentation for how exactly to make it do what it can do, so I'll describe that now (and later add such documentation to it, partially derived from this post).

There are 5 main interfaces of interest to an end-user:

LALR.Grammar
This class represents a context-free grammar.
LALR.Generator
This class produces a DPDA from a grammar using its generate(grammar) method.
LALR.DPDA
This class represents a DPDA for parsing a grammar.
LALR.DPDA.Interpreter.parse(dpda, input, actions[named, optional])
This function parses an array of tokens using the supplied DPDA.
LALR.Match.Nonterminal/LALR.Match.Terminal
A parse tree formed of these objects is the result of LALR.DPDA.Interpreter.parse().
Tokens
The parsers are intended to parse a sequence of tokens, and make some assumptions about the format of these tokens.
Actions classes
These are user-supplied classes that can be supplied to LALR.DPDA.Interpreter.parse() in order to run semantic actions during the parsing process.

Now, I'll describe each in greater detail.

LALR.Grammar

A grammar has a start symbol, a set of nonterminal symbols, a set of terminal symbols, and a set of production rules (which map nonterminal symbols to a sequence of symbols). The start symbol must be a nonterminal symbol, and each symbol in the right-hand-side of a production rule must be either a terminal or nonterminal symbol. Violation of these constraints will throw an exception in the relevant mutator methods.

A freshly-created LALR.Grammar has no terminals, nonterminals, or rules, and does not have a specified start symbol.

One can add nonterminal and terminal symbols with the .add_terminal(name) and .add_nonterminal(name), methods respectively. The .is_terminal(name) and .is_nonterminal(name) test whether the symbols are terminal or nonterminal symbols, respectively. The .is_symbol(name) method returns a true value if the name is either a terminal symbol or a nonterminal symbol. An array listing all of the nonterminal/terminals in the grammar can be retrieved using the .nonterminals()/.terminals() methods.

The .start() method can be used with an argument to set the start symbol, which must be a nonterminal, or without one to retrieve the currently set start symbol.

The .rule(nonterm) method returns a list of the production rules for a given nonterminal. Each production rule is represented as an array containing the symbols in its right-hand side. Do not modify this array or any of the sub-arrays. Doing so may have predictable results for now, but it will not in the future. The .add_rule(lhs, ...) variadic method adds a production rule with the nonterminal lhs as the left-hand-side and the remainder of the arguments forming the right-hand-side.

For grammar debugging, one can use the .as_bnf_string() to get a string in a yacc-like/BNF-like syntax describing the grammar.

LALR.Generator

This one is simple. Create a new LALR.Generator and call the .generate(grammar) method with a grammar. The .generate method will return the generated DPDA. Each LALR.Generator object will only be willing to generate a single DPDA. Attempts to re-use them will throw.

LALR.DPDA

If you just want to parse according to a grammar, all that matters about this class is that LALR.Generator.generate() returns one and that LALR.DPDA.Interpreter.parse() takes one as its argument.

LALR.DPDA.Interpreter.parse()

This function is called as LALR.DPDA.Interpreter.parse(dpda, input) where dpda is a LALR.DPDA, and input is an array of tokens representing the string to be parsed. The result is a tree of LALR.Match.Nonterminal and LALR.Match.Terminal objects representing the parsed string.

LALR.Match.Nonterminal/Terminal

Both of these classes have an ast attribute, which can be queried or set using the .ast() method. This can be used for building up an AST during the parse phase.

LALR.Match.Nonterminal objects represent parsing a nonterminal. They have a name and a number attribute (with accessor methods) which hold the name and number (0-based, within the other productions of the same name) of the production rule which was parsed. They also have a children attribute which is an array containing match objects for each symbol on the right-hand-side of the production rule.

LALR.Match.Terminal objects represent parsing a terminal symbol. They have a terminal attribute which holds the terminal which was matched.

Tokens

Tokens are compared to the terminal strings in the grammar. Other than comparing appropriately to these strings, no requirements are made on their behavior. They will be passed through the the terminal attribute of LALR.Match.Nonterminal as-is.

Actions classes

Actions objects provide functionality similar to semantic actions in other LALR parser generators.

When a terminal is read while parsing, if an actions object has been specified, the .terminal_action() method of the object will be called with the terminal and the match object. When a nonterminal is parsed, if an actions object has specified, the .nonterminal_action() method of the object will be called with the production name, the production number, and the match object.

This blog post should hopefully provide sufficient information to actually be able to write parsers using lalrskate, but if anyone has any questions or comments, feel free to contact me either by email (tyler.l.curtis@gmail.com) or on IRC (tcurtis at #parrot), though the former is more certain to reach me in a timely manner.