Output Format for Lexical Analyzer

Your lexical analyzer should output each token identified from the inputted MINI-L program. Each token should appear on a separate line of output, and the tokens should appear in the output in the same order as they appear in the inputted MINI-L program. To facilitate grading, the tokens must be outputted in the format described in the table below.

There are two types of lexical errors that your lexical analyzer should catch. They are described below.

Note: for this phase of the project, even syntactically incorrect MINI-L programs may still be parsed successfully into a list of tokens. The next phase of this project is where syntax errors will be captured.


List of Tokens

The following table describes the different kinds of tokens that may be outputted by your lexical analyzer. Comments and whitespace should be ignored by your lexical analyzer (you should not output any tokens for these).


Lexical Pattern in the Inputted MINI-L Program Token that Should Be Outputted
Reserved Words
function FUNCTION
beginparams BEGIN_PARAMS
endparams END_PARAMS
beginlocals BEGIN_LOCALS
endlocals END_LOCALS
beginbody BEGIN_BODY
endbody END_BODY
integer INTEGER
array ARRAY
of OF
if IF
then THEN
endif ENDIF
else ELSE
while WHILE
do DO
beginloop BEGINLOOP
endloop ENDLOOP
continue CONTINUE
read READ
write WRITE
and AND
or OR
not NOT
true TRUE
false FALSE
return RETURN
Arithmetic Operators
- SUB
+ ADD
* MULT
/ DIV
% MOD
Comparison Operators
== EQ
<> NEQ
< LT
> GT
<= LTE
>= GTE
Identifiers and Numbers
identifier (e.g., "aardvark", "BIG_PENGUIN", "fLaMInGo_17", "ot73r") IDENT XXXX [where XXXX is the identifier itself]
number (e.g., "17", "101", "90210", "0", "8675309") NUMBER XXXX [where XXXX is the number itself]
Other Special Symbols
; SEMICOLON
: COLON
, COMMA
( L_PAREN
) R_PAREN
[ L_SQUARE_BRACKET
] R_SQUARE_BRACKET
:= ASSIGN


Lexical Errors to Catch

Your lexical analyzer should catch two different types of lexical errors. If any such error is encountered during parsing of a MINI-L program, your lexical analyzer should terminate immediately after reporting the error message. The error message must include information about the line number and column position number within the line of the token associated with the error. The details are below.

Error Type 1: Unrecognized Symbol

Your lexical analyzer should report an error and terminate if an unrecognized symbol is encountered that is outside of a comment. For example, consider the following MINI-L function:

01. function test;
02. beginparams
03. endparams
04. beginlocals
05. n : integer;
06. endlocals
07. beginbody
08.    read n;
09.    n := n + 1?
10.    write n;
11. endbody
In the above program, the "?" symbol at line 5 (which is outside of a comment) is not defined in the MINI-L language. Thus, your lexical analyzer should output an "unrecognized symbol" error when it encounters the "?" (along with line number and position number information of the problematic symbol). For example:
Error at line 9, column 14: unrecognized symbol "?"


Error Type 2: Invalid Identifier

Your lexical analyzer should report an error and terminate if an invalid identifier is encountered. This can occur if the identifier starts with a digit or an underscore, or if the identifier ends with an underscore. For example, consider the following two MINI-L functions:

01. function test1;
02. beginparams
03. endparams
04. beginlocals
05. 2n : integer;
06. endlocals
07. beginbody
08. endbody
01. function test2;
02. beginparams
03. endparams
04. beginlocals
05. n_ : integer;
06. endlocals
07. beginprogram
08. endprogram
In both of the above functions, the identifier declared at line 5 is invalid. Thus, in both of these cases, your lexical analyzer should output an "invalid identifier" error when it encounters either the "2n" or the "n_". For example, in the first function above:
Error at line 5, column 0: identifier "2n" must begin with a letter
And in the second function above:
Error at line 5	, column 0: identifier "n_" cannot end with an underscore