The structure of compiler consists of two parts:
We’ll be covering the following topics in this tutorial:
Analysis part
• Analysis part breaks the source program into constituent pieces and imposes a grammatical structure on them which further uses this structure to create an intermediate representation of the source program.
• It is also termed as front end of compiler.
• Information about the source program is collected and stored in a data structure called symbol table.
Synthesis part
• Synthesis part takes the intermediate representation as input and transforms it to the target program.
• It is also termed as back end of compiler.
The design of compiler can be decomposed into several phases, each of which converts one form of source program into another.
The different phases of compiler are as follows:
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
4. Intermediate code generation
5. Code optimization
6. Code generation
All of the aforementioned phases involve the following tasks:
• Symbol table management.
• Error handling.
Lexical Analysis
• Lexical analysis is the first phase of compiler which is also termed as scanning.
• Source program is scanned to read the stream of characters and those characters are grouped to form a sequence called lexemes which produces token as output.
• Token: Token is a sequence of characters that represent lexical unit, which matches with the pattern, such as keywords, operators, identifiers etc.
• Lexeme: Lexeme is instance of a token i.e., group of characters forming a token. ,
• Pattern: Pattern describes the rule that the lexemes of a token takes. It is the structure that must be matched by strings.
• Once a token is generated the corresponding entry is made in the symbol table.
Input: stream of characters
Output: Token
Token Template: <token-name, attribute-value>
(eg.) c=a+b*5;
Lexemes and tokens
Lexemes | Tokens |
c | identifier |
= | assignment symbol |
a | identifier |
+ | + (addition symbol) |
b | identifier |
* | * (multiplication symbol) |
5 | 5 (number) |
Hence, <id, 1><=>< id, 2>< +><id, 3 >< * >< 5>
Syntax Analysis
• Syntax analysis is the second phase of compiler which is also called as parsing.
• Parser converts the tokens produced by lexical analyzer into a tree like representation called parse tree.
• A parse tree describes the syntactic structure of the input.
• Syntax tree is a compressed representation of the parse tree in which the operators appear as interior nodes and the operands of the operator are the children of the node for that operator.
Input: Tokens
Output: Syntax tree
Semantic Analysis
• Semantic analysis is the third phase of compiler.
• It checks for the semantic consistency.
• Type information is gathered and stored in symbol table or in syntax tree.
• Performs type checking.
Intermediate Code Generation
• Intermediate code generation produces intermediate representations for the source program which are of the following forms:
o Postfix notation
o Three address code
o Syntax tree
Most commonly used form is the three address code.
t1 = inttofloat (5)
t2 = id3* tl
t3 = id2 + t2
id1 = t3
Properties of intermediate code
• It should be easy to produce.
• It should be easy to translate into target program.
Code Optimization
• Code optimization phase gets the intermediate code as input and produces optimized intermediate code as output.
• It results in faster running machine code.
• It can be done by reducing the number of lines of code for a program.
• This phase reduces the redundant code and attempts to improve the intermediate code so that faster-running machine code will result.
• During the code optimization, the result of the program is not affected.
• To improve the code generation, the optimization involves
o Deduction and removal of dead code (unreachable code).
o Calculation of constants in expressions and terms.
o Collapsing of repeated expression into temporary string.
o Loop unrolling.
o Moving code outside the loop.
o Removal of unwanted temporary variables.
t1 = id3* 5.0
id1 = id2 + t1
Code Generation
• Code generation is the final phase of a compiler.
• It gets input from code optimization phase and produces the target code or object code as result.
• Intermediate instructions are translated into a sequence of machine instructions that perform the same task.
• The code generation involves
o Allocation of register.php and memory.
o Generation of correct references.
o Generation of correct data types.
o Generation of missing code.
LDF R2, id3
MULF R2, # 5.0
LDF R1, id2
ADDF R1, R2
STF id1, R1
Symbol Table Management
• Symbol table is used to store all the information about identifiers used in the program.
• It is a data structure containing a record for each identifier, with fields for the attributes of the identifier.
• It allows finding the record for each identifier quickly and to store or retrieve data from that record.
• Whenever an identifier is detected in any of the phases, it is stored in the symbol table.
Example
int a, b; float c; char z;
Symbol name | Type | Address |
a | Int | 1000 |
b | Int | 1002 |
c | Float | 1004 |
z | char | 1008 |
Example
extern double test (double x); double sample (int count) { double sum= 0.0; for (int i = 1; i < = count; i++) sum+= test((double) i); return sum; }
Symbol name | Type | Scope |
test | function, double | extern |
x | double | function parameter |
sample | function, double | global |
count | int | function parameter |
sum | double | block local |
i | int | for-loop statement |
Error Handling
• Each phase can encounter errors. After detecting an error, a phase must handle the error so that compilation can proceed.
• In lexical analysis, errors occur in separation of tokens.
• In syntax analysis, errors occur during construction of syntax tree.
• In semantic analysis, errors may occur at the following cases:
(i) When the compiler detects constructs that have right syntactic structure but no meaning
(ii) During type conversion.
• In code optimization, errors occur when the result is affected by the optimization. In code generation, it shows error when code is missing etc.
Figure illustrates the translation of source code through each phase, considering the statement
c =a+ b * 5.
Error Encountered in Different Phases
Each phase can encounter errors. After detecting an error, a phase must some how deal with the error, so that compilation can proceed.
A program may have the following kinds of errors at various stages:
Lexical Errors
It includes incorrect or misspelled name of some identifier i.e., identifiers typed incorrectly.
Syntactical Errors
It includes missing semicolon or unbalanced parenthesis. Syntactic errors are handled by syntax analyzer (parser).
When an error is detected, it must be handled by parser to enable the parsing of the rest of the input. In general, errors may be expected at various stages of compilation but most of the errors are syntactic errors and hence the parser should be able to detect and report those errors in the program.
The goals of error handler in parser are:
• Report the presence of errors clearly and accurately.
• Recover from each error quickly enough to detect subsequent errors.
• Add minimal overhead to the processing of correcting programs.
There are four common error-recovery strategies that can be implemented in the parser to deal with errors in the code.
o Panic mode.
o Statement level.
o Error productions.
o Global correction.
Semantical Errors
These errors are a result of incompatible value assignment. The semantic errors that the semantic analyzer is expected to recognize are:
• Type mismatch.
• Undeclared variable.
• Reserved identifier misuse.
• Multiple declaration of variable in a scope.
• Accessing an out of scope variable.
• Actual and formal parameter mismatch.
Logical errors
These errors occur due to not reachable code-infinite loop.