SOD: Symbolic Opcode Description

Well, GSoC is starting to wind down. I can't believe it's almost over. It feels like the "pencils down" date just jumped up out of nowhere. I had a lot more planned for HBDB but there are many flaws in Parrot's design that make even some of the most basic debugging tasks very difficult which I'll explain in a moment.

Over the past few months, I've learned a great deal about debugger design. There's not a lot of literature out there on the subject so much of what I've learned is mostly due to just tinkering around with GDB source code, reading debug format specifications, and just being nosey and curious in general. However, what I've realized is that Parrot's design does not make traditional symbolic debugging easy at all. The main design flaw being that we do not use a real debug segment in Parrot bytecode. All it contains is a very inaccurate line number to opcode mapping that is just about useless. Fortunately, I plan on fixing this (and so do you, you just don't know it yet).

Before I begin ranting, I suggest you read my recent message to parrot-dev. Much of what I will be referring to is already mentioned there.

A symbolic debugger needs high-level information about the original source code such as variables, data types, subroutines, classes, etc. so that it can form a proper relationship between the low-level opcodes and the high-level statements that generated it. However, Parrot bytecode does not preserve any such type of information and therefore, all a debugger can really do is perform low-level tasks; not knowing anything about the original source code that generated it. What Parrot needs is a generic debug data format for storing high-level source information in the debug segment.

I've taken it upon myself to begin writing up a specification. It can be found here. For now, I'm calling it SOD: Symbolic Opcode Description format.

The format itself is block/tree structured where any particular entity is "owned" by another entity. This tree-like structure makes it very easy to describe the static structure of a source file since the code's intermediate representation already forms a similar structure called an Abstract Syntax Tree. Only the minimal amount of information needed to describe a program object is stored.

The most basic entity in SOD is called a "Data Description Entity" or DDE which consists of a "class" that indicates what it describes and a series of "properties" that further describe the specific characteristics of the entity. An example class looks like CLASS_enum_type and PT_name for properties.

Consider the following enumeration:

enum e { A, B, C};

This would generate the following DDE (the first numbers are addresses and numbers inside <> are byte-sizes):

00ad:  <46>  CLASS_enum_type
         PT_sibling(0xdb)  # Next DDE owned by my parent
         PT_name("e")
         PT_byte_size(0x4)
         PT_elem_list(<18>(2="C") (1="B") (0="A"))

00db:  <4>  # Null entry, end of sibling chain

This is a very simplified example. I plan on including a much better one in the specification though. Go read it! Now!

I'm pretty much done picking out the bits and pieces of DWARF that I like. The next step is to have people review it and design an actual implementation.

This is obviously a very large task that I definitely don't consider part of my GSoC project. It's not for me or GSoC, it's for Parrot. This is going to be a full-fledged effort by anyone willing to help. Without this, it's just about impossible to implement any more real features into HBDB. Therefore, I foresee HBDB going into a hiatus while SOD is being written. Once it's done, we're going to have a really awesome debugger along with many opportunities for other analysis tools.