Given a translation scheme that contains pseudocode semantic actions, we explore how to insert the code into the recursive descent predictive parsing functions that recognize the grammar.
The Parser of the interpreter we are building uses a semantic value stack. In the examples that follow, the stack is a stack of Item structures. An Item structure contains a type and some data, some of which may be part of a C Language union. The type field tells us what is currently stored in the Item. The Item structure is typedef'd to be identical to the Token structure returned from the Scanner.
Items pushed on the value stack may be of various types, including strings, integers, real numbers, etc., depending on the requirements of the grammar. Thus, the elements pushed on the stack are not simple numbers, they are structures containing both a type field and a C Language union of different types. The type field remembers what type of data was stored in the union, so that when we pop the item we know what type it is.
The requirements for a basic value stack element are quite similar to those for a Token. In both, we need to keep track of some data and the type of the data. For convenience, we make the stack Item structure identical to the Token structure returned from the Scanner:
typedef Token Item;
This definition lets us use all the TokenType constants to indicate the type of item pushed on the stack. Tokens can be assigned directly to Items, and vice-versa. We can even keep track of the line numbers of intermediate expressions that are pused on the stack, using the lineno field of the structure:
/* The Token returned by the Scanner. * A token is a classified lexeme. Here, you find the lexeme * and its classification, and the line number on which it was found. * The union holds the converted form of the lexeme for those tokens * that need to be converted (e.g. to integers). */ typedef struct Token { TokenType type; /* type of the lexeme recognized */ char *lexeme; /* actual lexeme character string */ long lineno; /* line number on which lexeme was found */ union { long uint; /* for holding unsigned integers */ } u; } Token;
Thus, token.lexeme or item.lexeme refers to the string element inside the structure; token.u.uint or item.u.uint refers to the integer in the structure.
To copy a Token to an Item, you can use simple structure assignment, since they are the same basic type:
Token t = scanner(); Item i; i = t; /* this copies a Token structure to an Item structure */
The semantic action examples in this file use this Item structure as the basic data type for push and pop.
The stack routines know nothing about the contents of the structure. The stack is only responsible for pushing and popping copies of the structures.
You must be careful about who "owns" dynamic memory among the Scanner, Parser, and Stack:
- The Scanner (lexical analyser) manages its own memory. The Scanner initialization function allocates any buffers needed; the Scanner reallocates space to handle any size input; the Scanner termination function frees the buffers. Via the Token structure lexeme tag, the Scanner returns a pointer to its own dynamic memory used to hold the lexeme. The Scanner will re-use this same memory, growing it as needed, to hold each new token. Nothing that calls the Scanner can rely on the lexeme pointer always pointing to the same lexeme; the Scanner will keep re-using, re-allocating, and overwriting the lexeme buffer. Anything that wants to save the lexeme string must copy it.
- The Parser manages its own memory. When it needs to keep a string in a token passed to it from the Scanner, it allocates new memory and copies the string. The Parser cannot simply copy the pointer to the string, since the Scanner will eventually realloc or alter what the pointer points to when it gets the next token. The Parser keeps track of memory allocated and arranges that it all get freed before the program exits.
- The semantic expression value stack routines do not copy or free memory pointed to by things pushed onto or popped off of the stack. The stack should not know what is inside the structure pushed on the stack. If the Parser needs to push a string-valued token from the scanner onto the value stack, the Parser must create its own dynamic memory copy of the string first, and push that. When the string-valued structure is popped by the Parser, the associated dynamic memory must be freed by the Parser.
Let's look at how the Parser must handle pushing and popping a token received from the Scanner:
Token t = scanner(); Item i; i = t; i.lexeme = my_strdup(i.lexeme); -- get a copy of the string push( &i ); ... i = pop(); mem_free(i.lexeme); -- free the memory again ...Since we must do this lexeme allocation and copy for anything pushed on the stack, the Parser code can be simplified if we write some utility functions that handle this memory management:
Here are the prototypes and specifications for some useful utility functions used by the parser to handle the Item values on the stack. Code for these may be found in the assignment source directories.
static char *item_value(
Item *itemp /* IN: pointer to item to make printable */
);
- Return a string printable version of whatever is in an item. String-valued items are simply returned as themselves; numbers get turned into strings and returned as strings. This simplifies printing out what is in an Item, since whether it is a number or a string this function makes it a string suitable for use in printf().
static void copy_push_item(
Token *tokenp /* IN: pointer to token to copy */
);
- Evaluate a token by copying a Token structure to an Item and doing strdup() on any string(s), then push the resulting Item on the stack.
static void item_free(
Item *itemp /* IN: pointer to item whose memory is to be freed */
);
- Free any dynamic memory in use by this Item. (Frees the memory obtained by copy_push_item().)
Here is an augmented grammar production that is part of a translation scheme, with the semantic action symbols inserted:
<pstate> --> PRINT <xpress> {pop;print} ';'
The code that implements the semantic actions for this production is inserted into the recursive-descent parsing function exactly where it appears in the production. It looks something like this:
static Boolean pstate(void){
Item val -- structure containing type and valueCALL match(PRINT), return FALSE if it fails
CALL xpress(), return FALSE if it fails
-- {pop;print} action goes here:
val = pop() -- pop top of the value stack
printf("%s", item_value(&val) ) -- print what we popped
item_free( &val ) -- free memory of popped ItemCALL match(SEMICOLON), return FALSE if it fails
return TRUE -- parsing succeeded
}
Here is another augmented grammar production that is part of a translation scheme, with the semantic action symbols inserted:
<xpress> --> CONST {eval;push}
--> STR {eval;push}
--> '-' <xpress> {pop;negate if numeric;push}Unary Minus I
The semantic action for Unary Minus in an interpreted grammar is to pop the value of the operand off the semantic expression value stack, negate it, then push it back. Since the operand has to be evaluated first, before it can be negated, the semantic action for negation appears after the expression operand is parsed.
The operation for unary minus is not the same as two-operand subtraction; but, the operator used (the minus sign) is the same. Only the Parser knows, by the placement of the minus sign at the beginning of a certain production, that this use of the minus sign is the Unary Minus operation. The Scanner cannot make this distinction (since the Scanner knows nothing about the grammar), so the Scanner will always return the MINUS token type for both operations.
The code for the recursive-descent parsing function that parses and implements the semantic actions for the above productions looks something like this:
static Boolean xpress(void){
Item val; -- structure containing type and valueSWITCH TYPEOFTOKEN
CASE CONST: /*FALLTHROUGH*/
CASE STR:
-- copy_push_item is a utility function described above
copy_push_item( &token ) -- copy/push the token onto the stack
get next lookahead token
BREAK
CASE MINUS: -- grammar says this must be a Unary Minus
lineno = current line number
get next lookahead token
CALL xpress(), return FALSE if it fails
val = pop() -- pop the value off the stack
IF val.type is not an integer type THEN
-- type mismatch: print a good error message here
-- indicating that only integers can be negated
item_free( &val ) -- free memory of popped Item
RETURN FALSE
ENDIF
val.u.uint = - val.u.uint -- negate the value
val.type = integer type -- (redundant) set type of result
val.lineno = lineno -- pass on line number of expression
push( &val ) -- push value back on the stack
BREAK
DEFAULT:
-- print a good syntax error message here
RETURN FALSE
END SWITCH
RETURN TRUE -- parsing succeeded
}
The equivalence of the Item typedef and the Token typedef makes coding easy. The token structure returned by the Scanner can be directly assigned to an Item variable and pushed on the stack by the utility function.
Be careful to free the memory of anything popped off the stack that does not get pushed back on.
Here is another augmented grammar production that is part of a translation scheme, with the semantic action symbols inserted:
<plus> --> <pterm> '+' <pterm> {pop;pop;add if numeric;push}
static Boolean plus(void){
Item val1; -- structures containing type and value
Item val2;
Item val3;
Boolean ret; -- return value of this functionCALL pterm(), return FALSE if it fails
CALL match(PLUS), return FALSE if it fails
CALL pterm(), return FALSE if it fails
val2 = pop() -- pop second operand from stack
val1 = pop() -- pop first operand from stackIF val1.type and val2.type are both integers THEN
val3.lexeme = NULL -- new item has no string value
val3.type = integer type -- set the type of the result
val3.u.uint = val1.u.uint + val2.u.uint -- add them
push( &val3 ) -- push the sum
ret = TRUE
ELSE
-- type mismatch: print a good error message here
-- indicating that only integers can be added
ret = FALSE
ENDIFfree memory used by popped val1 and val2 Item structures
RETURN ret
}
Note that we don't actually need to have three Item structures; two would be sufficient. We could simply perform the arithmetic and put the result back into the first structure again, and push it back onto the value stack. We'll show that in the next section.
We have seen that some productions have semantic action symbols that require operations to be performed on operands popped off the semantic expression value stack. We can localize most of these operations into a single execute() function whose task is to pop operands and perform operations. This takes all the stack manipulation code out of most of the parsing functions, and makes them much simpler.
The execute() function takes an argument that tells what type of operation has to be performed, and a line number giving the line on which the operation was found (for error messages). For example, given the following translation scheme requiring addition:
<add_expression> --> <xpress> '+' <xpress> {pop;pop;add if #;push}The simplified code/pseudocode for the recursive-descent parsing function might look like this:
static Boolean add_expression(void){ lineno = line number of current token CALL xpress(), return FALSE if it fails CALL match(PLUS), return FALSE if it fails CALL xpress(), return FALSE if it fails CALL execute(PLUS,lineno), return FALSE if it fails RETURN TRUE }All the code for semantic value stack manipulation has been moved into the execute() function, simplifying the parsing function. The line number passed to execute() is picked to be something that will be useful in error messages, in case execute() is unable to perform the requested operation. The line number of the token that begins the add_expression is not a bad choice. The line number of the PLUS token might also be good.
Unary Minus II
For the unary minus operator, the code given above for the simple expression xpress() simplifies to this when all the push/pop code is moved into the execute() function:
CASE MINUS: -- grammar says this must be a Unary Minus lineno = current line number get next lookahead token CALL xpress(), return FALSE if it fails CALL execute(UNARY,lineno), return FALSE if it fails BREAKThe UNARY token type is not a token type returned by the Scanner. The Scanner cannot tell a Unary minus from a two-operand Subtraction minus. The Parser can tell them apart, and when the Parser recognizes a use of the single-operand unary minus, as above, it uses the UNARY token type as an argument to the execute() function instead of the MINUS token type. This tells execute() to do a negation using one operand instead of a subtraction using two operands.
The execute() function performs the indicated action on operands that have been pushed on the stack. The execute() function pops the number of operands off the stack that are appropriate for the given operation, tests for compatibility of operands with the given operator, does the type conversions (if any), and then does the actual operation and pushes the result back on the value stack. It returns TRUE if the operation worked, FALSE if it failed:
static Boolean execute( TokenType optype, -- IN: type of operator long int lineno -- IN: line number of expression ){ Item op1; -- First operand type and value Item op2; -- Second operand type and value SWITCH optype CASE PLUS: -- the '+' operation (two operands) op2 = pop() -- op2 was pushed last op1 = pop() -- op1 was pushed first IF op1.type and op2.type are not both integers THEN -- type mismatch: issue a good error message here -- indicating that only integers can be added free memory in both op1 and op2 popped Item structures RETURN FALSE ENDIF op1.u.num += op2.u.num -- do the PLUS operation op1.type = integer type -- (redundant) set type of result op1.lineno = lineno -- pass on line number of expression push( &op1 ) -- push the result back on stack free memory of op2 popped Item structure RETURN TRUE CASE UNARY: -- Unary Minus (one operand) op1 = pop() -- pop the operand off the stack IF op1.type is not an integer type THEN -- type mismatch: print a good error message here -- indicating that only integers can be negated free memory of op1 popped Item structure RETURN FALSE ENDIF op1.u.uint = - op1.u.uint -- negate the value op1.type = integer type -- (redundant) set type of result op1.lineno = lineno -- pass on line number of expression push( &op1 ) -- push value back on the stack RETURN TRUE CASE ... : -- other cases go here RETURN TRUE DEFAULT: -- should never get here unless parser is programmed incorrectly print an error message about the unknown operator RETURN FALSE (or exit the program) END SWITCH }Simple changes to execute() can handle each new operator. The operation doesn't have to be an arithmetic operation; it might be a print statement that pops and prints values, or an assignment statement that pops and puts values into the symbol table. All the semantic action code can be kept in the one execute() function.
Only operands that have been popped and not pushed back on the stack need to be freed by item_free().