How to convert PIR to C

OBSOLETE: Needs to be updated post-PDD17

When writing a compiler for parrot, it is actually very easy to implement it in PIR - aside from lack of control flow, it has a lot of benefit:

  • it makes string operations very easy
  • handles the parrot calling conventions for you.
  • ...

Each opcode invocation, however, has overhead. So after benchmarking how much time is spent in a particular PIR .sub, you may find it beneficial speed wise to replace that sub with one written in C.

Step 1:

Decide where you want the C code to live:

  • In a dynamic opcode?
  • As a class or instance method on a PMC? (Or, if the original code was a class, as an entire PMC)
  • In a C library? (you may wish to choose this even if the user-facing method is a dynamic opcode: simply dispatch to your c method from there.)

If this code is tied to a particular type of data (say, escaping a string), consider adding it to the PMC.

We'll assume PMC for now, and that you're adding this to an existing PMC. (Starting a PMC from scratch is another tutorial)

Step 2:

Now, determine if your method needs to follow the parrot calling conventions, or more simple C-style conventions. If you're returning void or a single value, and are taking a fixed number of arguments, you can use the C-stlye conventions. Any optional, named, slurpy or otherwise arguments warrant using the parrot calling conventions.

The C code for each belongs in the .pmc file, e.g. src/pmc/codestring.pmc

The shell for a C-style method (that takes a single PMC arg and returns a single PMC result) is:


METHOD PMC* method_name(PMC* source) {
   ...
}

The shell for a PCC-style method (that takes a string, and slurps up the remaining arguments into an array automtically) is:


PCCMETHOD void method_name(STRING* fmt, PMC* args: slurpy) {
    PMC* result;
    ...
    PCCRETURN(PMC* result)
}

Note that at the C level, this method has a void return signature: the PMC to C compiler expands the not-quite-macros and argugment declarations here into the required C code to implement the calling conventions.

Step 3:

Now to actually write some code. How do you convert PIR into C? We've looked at the equivalent of the .sub/.end directives and the .return directive so far.

Here's a simple sub that takes an input string and returns a string of the same length that preserves case and whitespace, but obscures the content. (thanks to tlily for the inspiration)


.include 'cclass.pasm'

.sub 'murfle'
    .param string original

    .local string result
    result = ""
    .local string replacement
    .local int len
    replacement = 'blah'
    len = 4 # length of replacement

    .local int orig_len
    orig_len = length original

    .local int orig_pos, pos
    orig_pos = 0
    pos = 0
  loop:
    if orig_pos >= orig_len goto end_loop

    $S1 = substr original, orig_pos, 1

    $I0 = is_cclass .CCLASS_UPPERCASE, $S1, 0
    if $I0 goto handle_uppercase

    $I0 = is_cclass .CCLASS_LOWERCASE, $S1, 0
    if $I0 goto handle_lowercase

    goto handle_non_alpha

  handle_lowercase:
    $S2 = substr replacement, pos, 1
    $S2 = downcase $S2
    goto loop_continue

  handle_uppercase:
    $S2 = substr replacement, pos, 1
    $S2 = upcase $S2
    goto loop_continue

  handle_non_alpha:
    result .= $S1
    inc orig_pos
    pos = 0 # Start replacement text over.
    goto loop
    
  loop_continue:
    result .= $S2
    inc orig_pos
    inc pos
    pos = pos % len
    goto loop
  end_loop:
    
    .return(result)
.end

The first part of our conversion will be a simple, line for line translation from PIR into C. Wherever there is an opcode, you can look in src/ops/*.ops to peek under the hood and see what C level routines are called. You'll notice that in C, we have to explicitly pass around an interpreter, which is usually hidden from us in PIR; We don't have to convert any of the flow control in the first step, since C supports labels and gotos. As long as our math is with simple int/num registers, no real conversion is needed there either. The one real change here is that we've combined our declarations an initializations, something that we are unable to do in PIR.

First, we need to include a header file for our cclass constants, just as we did in PIR.


    #include "parrot/cclass.h"

    ...

Then, we map out the method signature: we're going to be taking in a string and returning a string. We also declare and initialize all the variables we're using. Note that we're even declaring the temporary variables that we used in the PIR ($I0, etc.);

We're using two of the four basic registers types here, string (`STRING *`), and int (`INTVAL`). There are also PMC (`PMC *`), and numeric (`FLOATVAL`).

The string_from_literal method is used to create a parrot-style string from a literal C string. Notice we're passing in INTERP, even though that's not defined anywhere: when this PMC is converted to the actual C that will be compiled, murfle will accept an INTERP parameter that will be passed along whenever the method is invoked, and we in turn pass it along to almost every method we call.

We'll discuss string_length in a moment...


    METHOD STRING* murfle(STRING* original)    {
        STRING* result      = string_from_literal(INTERP, "");
        STRING* replacement = string_from_literal(INTERP, "blah");
        INTVAL  len         = 4; /* length of replacement */
        INTVAL  orig_len    = string_length(INTERP, original);
        INTVAL  orig_pos    = 0;
        INTVAL  pos         = 0;
        STRING* S1;
        STRING* S2;
        INTVAL  I0;

Now, we convert the initial portion of the loop. This is a line for line translation: the only real change aside from C syntax is that instead of using opcodes for is_cclass and substr, we've replaced those with calls to the actual C functions. How do you know which C to call? Simply look in the opcode definition, and translate the placeholder $1, $2 there into the actual variables.

This is where we got the definition for string_length above - we just pulled it from the length opcode.


      loop:
        if (orig_pos >= orig_len) goto end_loop;
        S1 = string_substr(INTERP, original, orig_pos, 1, &S1, 0);

        I0 = Parrot_string_is_cclass(INTERP, enum_cclass_uppercase, S1, 0);
        if (I0) goto handle_uppercase;

        I0 = Parrot_string_is_cclass(INTERP, enum_cclass_lowercase, S1, 0);
        if (I0) goto handle_lowercase;

        goto handle_non_alpha;

Next are the conversions to upper or lowercase: again, we pull directly from the opcode definitions for substr, downcase and upcase, and leave the primitive flow control.


      handle_lowercase:
        S2 = string_substr(INTERP, replacement, pos, 1, &S2, 0);
        S2 = string_downcase(INTERP, S2);
        goto loop_continue;

      handle_uppercase:
        S2 = string_substr(INTERP, replacement, pos, 1, &S2, 0);
        S2 = string_upcase(INTERP, S2);
        goto loop_continue;

Here we have a trivial syntax conversion for the comments, but some slightly more interesting code. The original pir begins with 'result .= $S1' which has no visible opcode. How do you determine what C to call in that case? The easiest way is to add a 'trace 1' opcode just before the PIR code you're translating: this will show which opcodes are being called when the PIR is execute. In this case, it's an append opcode, which we can then translate to C.

The math here is on basic int types, and is trivial enough that I just translate it without finding the opcode; inc become a post-decrement, and the assignment and mod operators work as expected.


      handle_non_alpha:
        result = string_append(INTERP, result, S1);
        orig_pos++;
        pos = 0; /* Start replacement text over. */
        goto loop;

      loop_continue:
        result = string_append(INTERP, result, S2);
        orig_pos++;
        pos++;
        pos = pos % len; 
        goto loop;

Finally, we return the string result we've built up using the standard C return. If this were a PCCMETHOD, we'd need to use the PCCRETURN macro.


      end_loop:
        return(result);
    }

Step 4:

We can now update the code that called this PIR sub and run all our tests (you have tests, right?) to make sure we didn't break any functionality.

This will change from:


  $S2 = murfle($S1)

to a class method call on whatever PMC we stuck this to:


  $P1 = new 'MooString'
  $S2 = $P1.'murfle'($S1)

Step 5:

Finally, we can cleanup our C code to make it more idiomatic, straightforward, and readable.

Let's step through the first pass code and point out some cleanups:


      loop:
        if (orig_pos >= orig_len) goto end_loop;

        ...
        goto loop;
      end_loop:

That's a while loop. We just can't write those in PIR, so we were stuck with the more verbose syntax. Any 'goto loop' or 'goto endloop' in the code can be rewritten as continue, break.


        I0 = Parrot_string_is_cclass(INTERP, enum_cclass_lowercase, S1, 0);
        if (I0) goto handle_lowercase;

We don't use I0 anywhere after that conditional. The only reason it was broken out in PIR is due to the limitations of the syntantic sugar available for if: we don't have that limitation in C, so this can be combined into a single statement. Also, there are two of these if's back to back which are contradictory: these should really be `else if`s.


      handle_lowercase:
        S2 = string_substr(INTERP, replacement, pos, 1, &S2, 0);
        S2 = string_downcase(INTERP, S2);
        goto loop_continue;

The only place this label is referred to is in the conditional above. The if/goto can be replaced by the more natural if/block.

Additionally, we can combine these two method calls into a single C statement.

Finally, we still have two variables, S1 & S2, that should be renamed to something a little more maintainable.

With these simplifications, let's try another pass at our C:


  METHOD STRING* murfle(STRING* original)
  { 
    STRING* result      = string_from_literal(INTERP, "");
    STRING* replacement = string_from_literal(INTERP, "blah");
    INTVAL  len         = 4; /* length of replacement */
    INTVAL  orig_len    = string_length(INTERP, original);
    INTVAL  orig_pos    = 0;
    INTVAL  pos         = 0;
    STRING* repl_char;
    STRING* orig_char;
 
    while (orig_pos < orig_len)
    { 
      orig_char = string_substr(INTERP, original, orig_pos, 1, &orig_char, 0);
      if (Parrot_string_is_cclass(INTERP, enum_cclass_uppercase, orig_char, 0))  {
        repl_char = string_upcase(INTERP,
            string_substr(INTERP, replacement, pos, 1, &repl_char, 0)
        );
        pos++;
        pos = pos % len; 
      } 
      else if (Parrot_string_is_cclass(INTERP, enum_cclass_lowercase, orig_char, 0)) {
        repl_char = string_downcase(INTERP,
            string_substr(INTERP, replacement, pos, 1, &repl_char, 0)
        );
        pos++;
        pos = pos % len; 
      } else {  
        repl_char = orig_char;
        pos = 0; /* Start replacement text over. */
      } 
      result = string_append(INTERP, result, repl_char);  
      orig_pos++;
    } 
    return(result);
  }

Enjoy.

tags: