Semantic Action Coding

This page last updated: Sunday September 27, 1998 01:07

Overview

Given a translation scheme that contains pseudocode semantic actions, we explore how to insert the code into the recursive descent predictive parsing functions that recognize the grammar.

The Parser of the interpreter we are building uses a semantic value stack. In the examples that follow, the stack is a stack of Item structures. An Item structure contains a type and some data, some of which may be part of a C Language union. The type field tells us what is currently stored in the Item. The Item structure is typedef'd to be identical to the Token structure returned from the Scanner.

Using the Token structure as the stack data structure

Items pushed on the value stack may be of various types, including strings, integers, real numbers, etc., depending on the requirements of the grammar. Thus, the elements pushed on the stack are not simple numbers, they are structures containing both a type field and a C Language union of different types. The type field remembers what type of data was stored in the union, so that when we pop the item we know what type it is.

The requirements for a basic value stack element are quite similar to those for a Token. In both, we need to keep track of some data and the type of the data. For convenience, we make the stack Item structure identical to the Token structure returned from the Scanner:
typedef Token Item;

This definition lets us use all the TokenType constants to indicate the type of item pushed on the stack. Tokens can be assigned directly to Items, and vice-versa. We can even keep track of the line numbers of intermediate expressions that are pused on the stack, using the lineno field of the structure:

/* The Token returned by the Scanner.
 * A token is a classified lexeme.  Here, you find the lexeme
 * and its classification, and the line number on which it was found.
 * The union holds the converted form of the lexeme for those tokens
 * that need to be converted (e.g. to integers).
 */
typedef struct Token {
        TokenType type; /* type of the lexeme recognized */
        char *lexeme;   /* actual lexeme character string */
        long lineno;    /* line number on which lexeme was found */
        union {
            long uint;  /* for holding unsigned integers */
        } u;
} Token;

Thus, token.lexeme or item.lexeme refers to the string element inside the structure; token.u.uint or item.u.uint refers to the integer in the structure.

To copy a Token to an Item, you can use simple structure assignment, since they are the same basic type:
Token t = scanner();
Item i;

i = t;   /* this copies a Token structure to an Item structure */

The semantic action examples in this file use this Item structure as the basic data type for push and pop.

The stack routines know nothing about the contents of the structure. The stack is only responsible for pushing and popping copies of the structures.

Managing dynamic memory

You must be careful about who "owns" dynamic memory among the Scanner, Parser, and Stack:

The Scanner (lexical analyser) manages its own memory. The Scanner initialization function allocates any buffers needed; the Scanner reallocates space to handle any size input; the Scanner termination function frees the buffers. Via the Token structure lexeme tag, the Scanner returns a pointer to its own dynamic memory used to hold the lexeme. The Scanner will re-use this same memory, growing it as needed, to hold each new token. Nothing that calls the Scanner can rely on the lexeme pointer always pointing to the same lexeme; the Scanner will keep re-using, re-allocating, and overwriting the lexeme buffer. Anything that wants to save the lexeme string must copy it.

The Parser manages its own memory. When it needs to keep a string in a token passed to it from the Scanner, it allocates new memory and copies the string. The Parser cannot simply copy the pointer to the string, since the Scanner will eventually realloc or alter what the pointer points to when it gets the next token. The Parser keeps track of memory allocated and arranges that it all get freed before the program exits.

The semantic expression value stack routines do not copy or free memory pointed to by things pushed onto or popped off of the stack. The stack should not know what is inside the structure pushed on the stack. If the Parser needs to push a string-valued token from the scanner onto the value stack, the Parser must create its own dynamic memory copy of the string first, and push that. When the string-valued structure is popped by the Parser, the associated dynamic memory must be freed by the Parser.

Let's look at how the Parser must handle pushing and popping a token received from the Scanner:

Token t = scanner();
Item i;

i = t;
i.lexeme = my_strdup(i.lexeme);    -- get a copy of the string
push( &i );
...
i = pop();
mem_free(i.lexeme);                -- free the memory again
...
Since we must do this lexeme allocation and copy for anything pushed on the stack, the Parser code can be simplified if we write some utility functions that handle this memory management:

Utility functions for handling Items

Here are the prototypes and specifications for some useful utility functions used by the parser to handle the Item values on the stack. Code for these may be found in the assignment source directories.

static char *item_value(    Item *itemp /* IN: pointer to item to make printable */ );

Return a string printable version of whatever is in an item. String-valued items are simply returned as themselves; numbers get turned into strings and returned as strings. This simplifies printing out what is in an Item, since whether it is a number or a string this function makes it a string suitable for use in printf().

static void copy_push_item(    Token *tokenp /* IN: pointer to token to copy */ );

Evaluate a token by copying a Token structure to an Item and doing strdup() on any string(s), then push the resulting Item on the stack.

static void item_free(    Item *itemp /* IN: pointer to item whose memory is to be freed */ );

Free any dynamic memory in use by this Item. (Frees the memory obtained by copy_push_item().)

A Print Statement - `<pstate>`

Here is an augmented grammar production that is part of a translation scheme, with the semantic action symbols inserted:

<pstate> --> PRINT <xpress> {pop;print} ';'

The code that implements the semantic actions for this production is inserted into the recursive-descent parsing function exactly where it appears in the production. It looks something like this:
static Boolean pstate(void){ Item val -- structure containing type and value CALL match(PRINT), return FALSE if it fails CALL xpress(), return FALSE if it fails -- {pop;print} action goes here: val = pop() -- pop top of the value stack printf("%s", item_value(&val) ) -- print what we popped item_free( &val ) -- free memory of popped Item CALL match(SEMICOLON), return FALSE if it fails return TRUE -- parsing succeeded }

A Simple Expression - `<xpress>`

Here is another augmented grammar production that is part of a translation scheme, with the semantic action symbols inserted:

<xpress> --> CONST {eval;push} --> STR {eval;push} --> '-' <xpress> {pop;negate if numeric;push}

Unary Minus I

The semantic action for Unary Minus in an interpreted grammar is to pop the value of the operand off the semantic expression value stack, negate it, then push it back. Since the operand has to be evaluated first, before it can be negated, the semantic action for negation appears after the expression operand is parsed.

The operation for unary minus is not the same as two-operand subtraction; but, the operator used (the minus sign) is the same. Only the Parser knows, by the placement of the minus sign at the beginning of a certain production, that this use of the minus sign is the Unary Minus operation. The Scanner cannot make this distinction (since the Scanner knows nothing about the grammar), so the Scanner will always return the MINUS token type for both operations.

The code for the recursive-descent parsing function that parses and implements the semantic actions for the above productions looks something like this:
static Boolean xpress(void){ Item val; -- structure containing type and value SWITCH TYPEOFTOKEN CASE CONST: /*FALLTHROUGH*/ CASE STR: -- copy_push_item is a utility function described above copy_push_item( &token ) -- copy/push the token onto the stack get next lookahead token BREAK CASE MINUS: -- grammar says this must be a Unary Minus lineno = current line number get next lookahead token CALL xpress(), return FALSE if it fails val = pop() -- pop the value off the stack IF val.type is not an integer type THEN -- type mismatch: print a good error message here -- indicating that only integers can be negated item_free( &val ) -- free memory of popped Item RETURN FALSE ENDIF val.u.uint = - val.u.uint -- negate the value val.type = integer type -- (redundant) set type of result val.lineno = lineno -- pass on line number of expression push( &val ) -- push value back on the stack BREAK DEFAULT: -- print a good syntax error message here RETURN FALSE END SWITCH RETURN TRUE -- parsing succeeded }

The equivalence of the Item typedef and the Token typedef makes coding easy. The token structure returned by the Scanner can be directly assigned to an Item variable and pushed on the stack by the utility function.

Be careful to free the memory of anything popped off the stack that does not get pushed back on.

A Plus Expression - `<plus>`

Here is another augmented grammar production that is part of a translation scheme, with the semantic action symbols inserted:
<plus> --> <pterm> '+' <pterm> {pop;pop;add if numeric;push} static Boolean plus(void){ Item val1; -- structures containing type and value Item val2; Item val3; Boolean ret; -- return value of this function CALL pterm(), return FALSE if it fails CALL match(PLUS), return FALSE if it fails CALL pterm(), return FALSE if it fails val2 = pop() -- pop second operand from stack val1 = pop() -- pop first operand from stack IF val1.type and val2.type are both integers THEN val3.lexeme = NULL -- new item has no string value val3.type = integer type -- set the type of the result val3.u.uint = val1.u.uint + val2.u.uint -- add them push( &val3 ) -- push the sum ret = TRUE ELSE -- type mismatch: print a good error message here -- indicating that only integers can be added ret = FALSE ENDIF free memory used by popped val1 and val2 Item structures RETURN ret }

Note that we don't actually need to have three Item structures; two would be sufficient. We could simply perform the arithmetic and put the result back into the first structure again, and push it back onto the value stack. We'll show that in the next section.

A function to execute operations

We have seen that some productions have semantic action symbols that require operations to be performed on operands popped off the semantic expression value stack. We can localize most of these operations into a single execute() function whose task is to pop operands and perform operations. This takes all the stack manipulation code out of most of the parsing functions, and makes them much simpler.

The execute() function takes an argument that tells what type of operation has to be performed, and a line number giving the line on which the operation was found (for error messages). For example, given the following translation scheme requiring addition:
<add_expression> --> <xpress> '+' <xpress> {pop;pop;add if #;push}
The simplified code/pseudocode for the recursive-descent parsing function might look like this:
   static Boolean
add_expression(void){
   lineno = line number of current token
   CALL xpress(), return FALSE if it fails
   CALL match(PLUS), return FALSE if it fails
   CALL xpress(), return FALSE if it fails
   CALL execute(PLUS,lineno), return FALSE if it fails
   RETURN TRUE
}
All the code for semantic value stack manipulation has been moved into the execute() function, simplifying the parsing function. The line number passed to execute() is picked to be something that will be useful in error messages, in case execute() is unable to perform the requested operation. The line number of the token that begins the add_expression is not a bad choice. The line number of the PLUS token might also be good.

Unary Minus II

For the unary minus operator, the code given above for the simple expression xpress() simplifies to this when all the push/pop code is moved into the execute() function:
CASE MINUS:   -- grammar says this must be a Unary Minus
       lineno = current line number
       get next lookahead token
       CALL xpress(), return FALSE if it fails
       CALL execute(UNARY,lineno), return FALSE if it fails
       BREAK
The UNARY token type is not a token type returned by the Scanner. The Scanner cannot tell a Unary minus from a two-operand Subtraction minus. The Parser can tell them apart, and when the Parser recognizes a use of the single-operand unary minus, as above, it uses the UNARY token type as an argument to the execute() function instead of the MINUS token type. This tells execute() to do a negation using one operand instead of a subtraction using two operands.

The execute() function performs the indicated action on operands that have been pushed on the stack. The execute() function pops the number of operands off the stack that are appropriate for the given operation, tests for compatibility of operands with the given operator, does the type conversions (if any), and then does the actual operation and pushes the result back on the value stack. It returns TRUE if the operation worked, FALSE if it failed:
   static Boolean
execute(
   TokenType optype,     -- IN: type of operator
   long int lineno       -- IN: line number of expression
){
   Item op1;             -- First operand type and value
   Item op2;             -- Second operand type and value
   
   SWITCH optype
   CASE PLUS:            -- the '+' operation (two operands)
      op2 = pop()        -- op2 was pushed last 
      op1 = pop()        -- op1 was pushed first

      IF op1.type and op2.type are not both integers THEN
         -- type mismatch: issue a good error message here
         -- indicating that only integers can be added
         free memory in both op1 and op2 popped Item structures
         RETURN FALSE
      ENDIF

      op1.u.num += op2.u.num     -- do the PLUS operation
      op1.type = integer type    -- (redundant) set type of result
      op1.lineno = lineno        -- pass on line number of expression
      push( &op1 )               -- push the result back on stack
      free memory of op2 popped Item structure
      RETURN TRUE
   CASE UNARY:                   -- Unary Minus (one operand)
      op1 = pop()                -- pop the operand off the stack

      IF op1.type is not an integer type THEN
         -- type mismatch: print a good error message here
         -- indicating that only integers can be negated
         free memory of op1 popped Item structure
         RETURN FALSE
      ENDIF

      op1.u.uint = - op1.u.uint  -- negate the value
      op1.type = integer type    -- (redundant) set type of result
      op1.lineno = lineno        -- pass on line number of expression
      push( &op1 )               -- push value back on the stack
      RETURN TRUE
   CASE ... :                     -- other cases go here
      RETURN TRUE
   DEFAULT:
      -- should never get here unless parser is programmed incorrectly
      print an error message about the unknown operator
      RETURN FALSE (or exit the program)
   END SWITCH
}
Simple changes to execute() can handle each new operator. The operation doesn't have to be an arithmetic operation; it might be a print statement that pops and prints values, or an assignment statement that pops and puts values into the symbol table. All the semantic action code can be kept in the one execute() function.

Only operands that have been popped and not pushed back on the stack need to be freed by item_free().

Notes on stack handling

Free the memory used by any operands popped off the stack that are not pushed back. You don't have to free the memory if you push the operand back on the stack, since the memory will be freed when the operand is finally popped from the stack some time later.
Panic-mode error recovery must flush and free the memory of anything still left over on the value stack when an error is encountered, as part of resynchronizing the parser and getting it ready to parse a new statement. (The value stack should be empty at the start of parsing each new statement.) A loop that tests the stack for empty and pops each item and uses item_free() on it is all that is needed. For debugging purposes, print the type and value of everything popped off the stack during panic-mode recovery.
If you pop an operand into an Item structure variable, then want to re-use that Item variable to hold some other Item or Token, make sure you item_free() any strings in the structure before you overwrite it.
Don't put pop() calls into the argument list for a function call. C is free to evaluate function arguments in any order, so the pop() calls will not always execute in the order you expect.

Ian D. Allen CST 8152 Home Page

Semantic Action Coding

Overview

Using the Token structure as the stack data structure

Managing dynamic memory

Utility functions for handling Items

A Print Statement - <pstate>

A Simple Expression - <xpress>

Unary Minus I

A Plus Expression - <plus>

A function to execute operations

Unary Minus II

Notes on stack handling

A Print Statement - `<pstate>`

A Simple Expression - `<xpress>`

A Plus Expression - `<plus>`