CST 8152 - Assignment 3

Due 12 noon Monday June 16, 1997

Assigned: Monday, June 2 in class

Content last revised: 12 noon Monday, June 2

Due date changed on June 9.


In this assignment you write a DFA-based Lexical Analyser that recognizes some of the basic lexemes used in an Interpreter. You should be able to use much of your code from Assignment Two in this program.

Your Lexical Analyser will be written as a separate compilation unit, to hide its internal details. You will use a small set of public functions to set up the Analyser and interact with it. Only one global variable is allowed (see the instructions). See the course notes for help in designing a separate compilation unit using static functions and data.


Design, write, and thoroughly test a C Language module that functions as a Lexical Analyser (Scanner). Write a driver program that calls your Scanner repeatedly, returning and printing each token found by the Scanner in the input stream. Your Scanner has some public and some private functions to enable it to work. (Private functions and data are defined using the static keyword.) Some of these functions are described here:

Public function: lex_init()

This function sets up the Scanner. It receives and records in private storage the FILE* pointer of the file opened by the calling program. It does any other initialization needed to begin the scanning of the input stream, such as setting up dynamic memory buffers.

Public function: lex_gettoken()

This function reads the open file pointer and identifies a single token and the line number on which it was found. The token is a lexeme and its classification. The lexeme can be several different things, so it is stored as a C Language union. It is either empty, a character string in dynamic memory (char*), or an int or double numeric constant. The classification is an enum type denoting the type of the lexeme being returned. The enum has the following values, indicating what was found by the Scanner:

T_INT An unsigned integer constant, e.g. 0, 123, 32767
T_ID An identifier, as specified for previous assignments.
T_PLUS The plus sign: +
T_MINUS The minus sign: -
T_MUL The multiplication asterisk: *
T_DIV The division slash: /
T_EQU The assignment equals sign: =
T_SEMI The semicolon: ;
T_EOF A special token that indicates that the Scanner has reached end-of-file.
T_ERR Anything else. The union contains more details on the specific kind of error.

The Scanner places the lexeme, its classification type, and its line number in a global structure named lex_token. (The Scanner may also return a pointer to this structure, to facilitate coding, if desired.)

The Scanner leaves the input stream positioned ready to read the next lexme. A calling program will make repeated calls to lex_gettoken() to read successive tokens from an open input file.

Public function: lex_finished()

Call this function to close down the Scanner and release any memory resources it may have allocated. (Note that the open input file is not closed here, since it was not opened by the Scanner. Input files are the responsibility of the calling program [e.g. main()].)

Memory Management

The Scanner manages its own memory, including any memory allocated to or in the global lex_token structure. Functions outside the Scanner module must not modify the contents of the global token structure, and they must take care not to obtain pointers to storage that belongs to the Scanner module. (The Scanner module may eventually free or modify any storage it controls.)

This means that functions outside the Scanner module must copy any strings that they need to keep.


The Scanner module itself prints nothing except those error messages that cannot reasonably be returned as a T_ERR token to the calling function. Write a driver routine that opens an input file, initializes the Scanner, and repeatedly calls the Scanner to read all the tokens in the input file. Print the contents of the global token structure for each token found. When end-of-file is reached, tidy up and exit.

Example: The output should look something like this, only nicer:

Name: Ian Allen
Input file: Assignment Three Data-1.txt

Line  Type   Lexeme
  36  T_ID   area
  36  T_EQU  =
  36  T_ID   PI
  36  T_MUL  *
  36  T_ID   radius
  36  T_MUL  *
  36  T_ID   radius
  36  T_SEMI ;
  37  T_ID   print
  37  T_ID   area
  37  T_SEMI ;
  38  T_ERR  @$%&%$@
  39  T_EOF

Important Notes

The deliverables for your assignment are due in the Ian Allen assignment box before the due date.

Deliverables for this assignment

  1. The fully documented source listing of just your lexical analyser (including your own .h files).
    The submission must follow the online course submission guidelines.
  2. A description or listing of your input test file(s) (and output, if appropriate) showing how you tested your program. (Do not print out large test files. Describe them instead.) Remember to test empty files.
  3. I will show you how to deposit your assignment source code into a directory on the Algonquin server when the server is ready. All assignment source code will be collected for analysis later in the term.


The aggregate of all assignment marks comprises 20% of your final mark.
All assignments must be completed satisfactorily to get credit for the course.

Please review the C Programming Style guidelines. Assignment source code is marked for readability. You earn marks by making the source clear, simple, and concise. You lose marks if I have to puzzle over what a piece of code does.

Late assignments are handled according to the policy given in the course outline.

“Inside every big program is a little program struggling to get out.”

See also: Code Fragment for a DFA Scanner

See the course notes for programming help for the Scanner.

See also the code for a working sample DFA program.

Ian D. Allen CST8152 Home Page - Summer 1997