Compiler Design

Syntax Directed Definition (SDD) and Types of Syntax Directed Definitions

By Dinesh Thakur

Syntax directed definition specifies the values of attributes by associating semantic rules with the grammar productions.

It is a context free grammar with attributes and rules together which are associated with grammar symbols and productions respectively.

The process of syntax directed translation is two-fold:

• Construction of syntax tree and

• Computing values of attributes at each node by visiting the nodes of syntax tree.

Semantic actions

Semantic actions are fragments of code which are embedded within production bodies by syntax directed translation.

They are usually enclosed within curly braces ({ }).

It can occur anywhere in a production but usually at the end of production.

(eg.)

E—> E₁+ T {print ‘+’}

Types of translation

• L-attributed translation

o It performs translation during parsing itself.

o No need of explicit tree construction.

o L represents ‘left to right’.

• S-attributed translation

o It is performed in connection with bottom up parsing.

o ‘S’ represents synthesized.

Types of attributes

• Inherited attributes

o It is defined by the semantic rule associated with the production at the parent of node.

o Attributes values are confined to the parent of node, its siblings and by itself.

o The non-terminal concerned must be in the body of the production.

• Synthesized attributes

o It is defined by the semantic rule associated with the production at the node.

o Attributes values are confined to the children of node and by itself.

o The non terminal concerned must be in the head of production.

o Terminals have synthesized attributes which are the lexical values (denoted by lexval) generated by the lexical analyzer.

Syntax directed definition of simple desk calculator

Production	Semantic rules
L —> E_n	L.val = E.val
E —> E₁+ T	E.val = E₁.val+ T.val
E —> T	E.val = T.val
T—> T₁*F	T.val = T_i.val x F.val
T —> F	T.val = F.val
F —> (E)	F.val = E.val
F —> digit	F.val = digit.lexval

Syntax-directed definition-inherited attributes

Production	Semantic Rules
D —>TL	L.inh = T.type
T —> int	T.type =integer
T —> float	T.type = float
L —> L₁, id	L₁.inh = L.inh addType (id.entry, Linh)
L —> id	addType (id.entry, L.inh)

• Symbol T is associated with a synthesized attribute type.

• Symbol L is associated with an inherited attribute inh,

Types of Syntax Directed Definitions

S-attributed Definitions

Syntax directed definition that involves only synthesized attributes is called S-attributed. Attribute values for the non-terminal at the head is computed from the attribute values of the symbols at the body of the production.

The attributes of a S-attributed SDD can be evaluated in bottom up order of nodes of the parse tree. i.e., by performing post order traversal of the parse tree and evaluating the attributes at a node when the traversal leaves that node for the last time.

Production	Semantic rules
L —> E_n	L.val = E.val
E —> E₁+ T	E.val = E₁.val+ T.val
E —> T	E.val = T.val
T—> T₁*F	T.val = T_i.val x F.val
T —> F	T.val = F.val
F —> (E)	F.val = E.val
F —> digit	F.val = digit.lexval

L-attributed Definitions

The syntax directed definition in which the edges of dependency graph for the attributes in production body, can go from left to right and not from right to left is called L-attributed definitions. Attributes of L-attributed definitions may either be synthesized or inherited.

If the attributes are inherited, it must be computed from:

• Inherited attribute associated with the production head.

• Either by inherited or synthesized attribute associated with the production located to the left of the attribute which is being computed.

• Either by inherited or synthesized attribute associated with the attribute under consideration in such a way that no cycles can be formed by it in dependency graph.

Production	Semantic Rules
T —> FT’	T ‘.inh = F.val
T ‘ —> *FT₁’	T’₁.inh =T’.inh x F.val

In production 1, the inherited attribute T’ is computed from the value of F which is to its left. In production 2, the inherited attributed Tl’ is computed from T’. inh associated with its head and the value of F which appears to its left in the production. i.e., for computing inherited attribute it must either use from the above or from the left information of SDD.

Error Handling and Error Recovery In Syntax Analyzer

By Dinesh Thakur

• An efficient program should not terminate on an parse error.

• It must recover to parse the rest of the input and check for subsequent errors.

• For one line input, the routine yyparse() can be made to return 1 on error and then calls yyparse() again.

YACC program error handling

• Parser must be capable of detecting the error as soon as it encounters, i.e., when an input stream does not match the rules in grammar.

• If there is an error-handling subroutine in the grammar file, the parser can allow for entering the data again, ignoring the bad data or initiating a cleanup and recovery action.

• When the parser finds an error, it may need to reclaim parse tree storage, delete or alter symbol table entries and set switches to avoid generating further output.

• Error handling routines are used to restart the parser to continue its process even after the occurrence of error.

• Tokens following the error get discarded to restart the parser.

• The YACC command uses a special token name error, for error handling. The token is placed at places where error might occur so that it provides a recovery subroutine.

• To prevent subsequent occurrence of errors, the parser remains in error state until it processes three tokens following an error.

• The input is discarded and no message is produced, if an error occurred while the

parser remains in error state.

(eg.) stat : error ‘;’

o The above rule tells the parser that when there is an error, it should ignore the token and all following tokens until it finds the next semicolon.

o It discards all the tokens after the error and before the next semicolon.

o Once semicolon is found, the rule is reduced by parser and cleanup action associated with that rule will be performed.

Providing for error correction

The input errors can be corrected by entering a line in the data stream again.

input : error ‘\n’

{

printf (“Reenter last line:”);

}

input

{

$$ = $4;

} ;

The YACC statement, yyerrok is used to indicate that error recovery is complete.

This statement leaves the error state and begins processing normally.

input : error ‘\n’

yyerrok;

printf (“Reenter last line:”);

}

input

{

$$ = $4;

};

Clearing the Lookahead token

• When an error occurs, the lookahead token becomes the token at which the error was detected.

• The lookahead token must be changed if the error recovery action includes code to find the correct place to start processing again.

• To clear the lookahead token, the error-recovery action issues the following statement:

yyclearin;

To assist in error handling, macros can be placed in YACC actions.

Macros for error handling

YYERROR	Causes the parser to initiate error handling.
YYABORT	Causes the parser to return with a value of 1.
YYACCEPT	Causes the parser to return with a value of 0.
YYRECOVERING()	Returns a value of 1 if a syntax error has been detected and the parser has not yet fully recovered.

LR Parsers – Compiler Design

By Dinesh Thakur

LR parsers are used to parse the large class of context free grammars. This technique is called LR(k) parsing.

• L is left-to-right scanning of the input.

• R is for constructing a right most derivation in reverse.

• k is the number of input symbols of lookahead that are used in making parsing decisions.

There are three widely used algorithms available for constructing an LR parser:

• SLR(l) – Simple LR

o Works on smallest class of grammar.

o Few number of states, hence very small table.

o Simple and fast construction.

• LR( 1) – LR parser

o Also called as Canonical LR parser.

o Works on complete set of LR(l) Grammar.

o Generates large table and large number of states.

o Slow construction.

• LALR(l) – Look ahead LR parser

o Works on intermediate size of grammar.

o Number of states are same as in SLR(l).

Reasons for attractiveness of LR parser

• LR parsers can handle a large class of context-free grammars.

• The LR parsing method is a most general non-back tracking shift-reduce parsing method.

• An LR parser can detect the syntax errors as soon as they can occur.

• LR grammars can describe more languages than LL grammars.

Drawbacks of LR parsers

• It is too much work to construct LR parser by hand. It needs an automated parser generator.

• If the grammar contains ambiguities or other constructs then it is difficult to parse in a left-to-right scan of the input.

Model of LR Parser

LR parser consists of an input, an output, a stack, a driver program and a parsing table that has two functions

1. Action

2. Goto

The driver program is same for all LR parsers. Only the parsing table changes from one parser to another.

The parsing program reads character from an input buffer one at a time, where a shift reduces parser would shift a symbol; an LR parser shifts a state. Each state summarizes the information contained in the stack.

The stack holds a sequence of states, s_o, s₁, · ·· , S_m, where S_m is on the top.

Model of an LR Parser

Action This function takes as arguments a state i and a terminal a (or $, the input end marker). The value of ACTION [i, a] can have one of the four forms:

i) Shift j, where j is a state.

ii) Reduce by a grammar production A—> β.

iii) Accept.

iv) Error.

Goto This function takes a state and grammar symbol as arguments and produces a state.

If GOTO [I_i ,A] = I_j, the GOTO also maps a state i and non terminal A to state j.

Behavior of the LR parser

1. If ACTION[s_m, ai] = shift s. The parser executes the shift move, it shifts the next state s onto the stack, entering the configuration

a) S_m – the state on top of the stack.

b) a_i– the current input symbol.

2. If ACTION[s_m, a_i] =reduce A—> β, then the parser executes a reduce move, entering the configuration

(s₀s₁ … S_(m-r)S, a_i+l … a_n$)

a) where r is the length of β and s= GOTO[s_m – r, A].

b) First popped r state symbols off the stack, exposing state S_m-r·

c) Then pushed s, the entry for GOTO[s_m-r, A], onto the stack.

3. If ACTION[s_m, a_i] = accept, parsing is completed.

4. If ACTION[s_m, a_i] = error, the parser has discovered an error and calls an error recovery routine.

LR Parsing Algorithm

Algorithm LR Parsing Algorithm.

Input Input string w,

LR-Parsing table with functions ACTION and

GOTO for a grammar G

Output If w is in L(G), the reduction steps of a

bottom-up parse for w,

otherwise, an error indication.

Method Initially, the parser has So on its stack,

where So is the initial state, and w $ in the

input buffer.

let a be the first symbol of w $

while(l) { //repeat forever

let s be the state on top of the stack;

if(ACTION[s, a] =shift t {

push t onto the stack;

let a be the next input symbol;

} else if (ACTION [s, a] = reduce A—> β) {

pop β symbols off the stack;

let state t now be on top of the stack;

push GOTO[t, A] onto the stack;

output the production A—> β;

} else if (ACTION [s, a] accept) break;

//parsing is done

else call error-recovery routine;

}

LR(O) Items

An LR(O) item of a grammar G is a production of G with a dot at some position of the body.

(eg.)

A —> •XYZ

A —> XeYZ

A —> XYeZ

A —> XYZ•

One collection of set of LR(O) items, called the canonical LR(O) collection, provides finite automaton that is used to make parsing decisions. Such an automaton is called an LR(O) automaton.

LR(O) Parser I SLR(1) Parser

An LR(O)parser is a shift-reduce parser that uses zero tokens of lookahead to determine what action to take (hence the 0). This means that in any configuration of the parser, the parser must have an unambiguous action to choose-either it shifts a specific symbol or applies a specific reduction. If there are ever two or more choices to make, the parser fails and the grammar is not LR(O).

An LR parser makes shift-reduce decisions by maintaining states to keep track of parsing. States represent a set of items.

Closure of item sets

If I is a set of items for a grammar G, then CLOSURE(I) is the set of items constructed from I by the two rules.

• Initially, add every item I to CLOSURE(I).

• If A —> αB,β is in CLOSURE(I) and B —> ɣ is a production, then add the item B —> • ɣ to CLOSURE(i), if it is not already there. Apply this rule until no more items can be added to CLOSURE (!).

Construct canonical LR(O) collection

• Augmented grammar is defined with two functions, CLOSURE and GOTO. If G is a grammar with start symbol S, then augmented grammar G’ is G with a new start symbol S’ —> S.

• The role of augmented production is to stop parsing and notify the acceptance of the input i.e., acceptance occurs when and only when the parser performs reduction by S’ —> S.

Limitations of the LR(O) parsing method

Consider grammar for matched parentheses

1. S’ —> S

2. S’ —> (S) S

3. S’ —> Ɛ

The LR(O) DFA of grammar G is shown below

In states: 0, 2 and 4 parser can shift (and reduce Ɛ to S)

Conflicts

Conflicts are the situations which arise due to more than one option to opt for a particular step of shift or reduce.

• Two kinds of conflicts may arise.

Shift-reduce and reduce-reduce.

• In state 0 parser encounters a conflict.

It can shift state 2 on stack when next token is (.

It can reduce production 2: S —> Ɛ on +.

This is a called a shift-reduce conflict.

This conflict also appears in states 2 and 4.

Shift-reduce conflict parser can shift and can reduce.

Reduce-reduce conflict two (or more) productions can be reduced.

SLR(1) grammars

• SLR(l) parsing increases the power of LR(O) significantly.

Look ahead token is used to make parsing decisions

Reduce action is applied more selectively according to FOLLOW set.

• A grammar is SLR(l) if two conditions are met in every state.

If A —> α • x ɣ and B —> β ,•then token x Ɛ FOLLOW(B).

If A —> α • and B —> • then FOLLOW(A) П FOLLOW(B) = Ø.

• Violation of first condition results in shift-reduce conflict.

A—> α • x ɣ and B —> β• and x Ɛ FOLLOW(B) then …

Parser can shift x and reduce B —> β.

• Violation of second condition results in reduce-reduce conflict.

A—> α • and B —> β,•and x Ɛ FOLLOW(A) n FOLLOW(B).

Parser can reduce A —> α and B —> β,.

• SLR(l) grammars are a superset of LR(O) grammars.

LR(1) Parser I Canonical LR (CLR)

• Even more powerful than SLR(l) is the LR(l) parsing method.

• LR(l) includes LR(O) items and a look ahead token in itemsets.

• An LR(l) item consists of,

o Grammar production rule.

o Right-hand position represented by the dot and.

o Lookahead token.

o A —>X₁ · · · X_i • X_i+1 · · · X_n, l where l is a lookahead token

• The • represents how much of the right-hand side has been seen,

o X₁ · · · X_i appear on top of the stack.

o X_i+l · · · X_n are expected to appear on input buffer.

• The lookahead token l is expected after X₁ · · · X_n appears on stack.

• An LR(l) state is a set of LR(l) items.

Introduction to LALR Parser

• LALR stands for lookahead LR parser.

• This is the extension of LR(O) items, by introducing the one symbol of lookahead on the input.

• It supports large class of grammars.

• The number of states is LALR parser is lesser than that of LR( 1) parser. Hence, LALR is preferable as it can be used with reduced memory.

• Most syntactic constructs of programming language can be stated conveniently.

Steps to construct LALR parsing table

• Generate LR(l) items.

• Find the items that have same set of first components (core) and merge these sets into one.

• Merge the goto’s of combined itemsets.

• Revise the parsing table of LR(l) parser by replacing states and goto’s with combined states and combined goto’s respectively.

Type of Parsing

By Dinesh Thakur

Top-Down Parsing

Top-down parsing constructs parse tree for the input string, starting from root node and creating the nodes of parse tree in pre-order.

It is done by leftmost derivation for an input string.

General Strategies

• Top-down parsing involves constructing the parse tree starting from root node to leaf node by consuming tokens generated by lexical analyzer.

• Top-down parsing is characterized by the following methods:

• Brute-force method, accompanied by a parsing algorithm.

o All possible combinations are attempted before the failure to parse is recognized.

• Recursive descent, is a parsing technique which does not allow backup.

o Involves backtracking and left recursion.

• Top-down parsing with limited or partial backup.

Recursive Descent Parser

• Recursive descent parser is a top-down parser.

• It requires backtracking to find the correct production to be applied.

• The parsing program consists of a set of procedures, one for each non-terminal.

• Process begins with the procedure for start symbol.

• Start symbol is placed at the root node and on encountering each non-terminal, the procedure concerned is called to expand the non-terminal with its corresponding production.

• Procedure is called recursively until all non-terminals are expanded.

• Successful completion occurs when the scan over entire input string is done. ie., all terminals in the sentence are derived by parse tree.

void A()

{

choose an A-production, A —-> X₁ X₂ X₃… X_k;

for (i = 1 to k)

if (X_i is a non-terminal)

call procedure X_i ();

else if (X_i equals the current input symbol a)

advance the input to the next symbol;

else

error;

}

Limitation

• When a grammar with left recursive production is given, then the parser might get into infinite loop.

Hence, left recursion must be eliminated.

(eg.) Let grammar G be,

S —-> SAd

A —> ab I d

Recursive descent parser with backtracking

(eg.) Let grammar G be,

S —-> cAd

A —-> ab | d

w = cad

Explanation

• The root node contains the start symbol which is S.

• The body of production begins with c, which matches with the first symbol of the input string.

• A is a non-terminal which is having two productions A —-> ab I d.

• Apply the first production of A, which results in the string cabd that does not match with the given string cad.

• Backtrack to the previous step where the production of A gets expanded and try with alternate production of it.

• This produces the string cad that matches with the given string.

Limitation

• If the given grammar has more number of alternatives then the cost of backtracking will be high.

Recursive descent parser without backtracking

Recursive descent parser without backtracking works in a similar way as that of recursive descent parser with backtracking with the difference that each non-terminal should be expanded by its correct alternative in the first selection itself.

When the correct alternative is not chosen, the parser cannot backtrack and results in syntactic error.

Advantage

• Overhead associated with backtracking is eliminated.

Limitation

• When more than one alternative with common prefixes occur, then the selection of the correct alternative is highly difficult.

Hence, this process requires a grammar with no common prefixes for alternatives.

Predictive Parser I LL(1) Parser

• Predictive parsers are top-down parsers.

• It is a type of recursive descent parser but with no backtracking.

• It can be implemented non-recursively by using stack data structure.

• They can also be termed as LL (l) parser as it is constructed for a class of grammars called LL (l).

• The production to be applied for a non-terminal is decided based on the current input symbol.

A grammar G is LL(l) if there are two distinct productions A —> α | βwith the following conditions hold:

o For no terminal α and βderive strings beginning with a.

o At most one of α and βcan derive empty string.

o If β *—> Ɛ then α does not derive any string beginning with a terminal in FOLLOW(A).

o If α *—> Ɛ then βdoes not derive any string beginning with a terminal in FOLLOW(A).

In order to overcome the limitations of recursive descent parser, LL(1) parser is designed by using stack data structure explicitly to hold grammar symbols.

In addition to this,

• Left recursion is eliminated.

• Common prefixes are also eliminated (Left factoring).

Eliminating left recursion

A grammar is left recursive if it has a production of the form A —-> A α, for some string α.

To eliminate left recursion for the production, A —> A α I β

Rule

A —> β A’

A’ —> αA’ I Ɛ

Example

A —-> A α₁ | A α₂ | · · · | β₁ | β₂ | · · · | β_m

Solution:

A—-> β₁A’ | β₂A’ | ··· | β_mA’

A’—-> α₁A’ | α₂A’

Left factoring

When a production has more than one alternatives with common prefixes, then it is necessary to make right choice on production.

This can be done through rewriting the production until enough of the input has been seen.

To perform left-factoring for the production, A —> αβ₁| αβ₂

Rule

A —> α A’

A’ —> β₁I β₂

Example

A -> α β₁ I α β₂ I ···I α β_m I ɣ

Solution

A-> α A’

A’ -> β1I β2I ··· I βm

Computation of FIRST

FIRST(α) is the set of terminals that begin strings derived from α.

Rules

• To compute FIRST(X), where X is a grammar symbol,

• If X is a terminal, then FIRST(X) = {X}.

• If X -> Ɛis a production, then add Ɛto FIRST (X).

• If X is a non-terminal and X-> Y₁Y₂ · · · Y_kis a production, then add FIRST(Y₁) to FIRST(X). If Y₁derives c, then add FIRST(Y₂) to FIRST(X).

Computation of FOLLOW

FOLLOW(A) is the set of terminals a, that appear immediately to the right of A.

For rightmost sentential form of A, $ will be in FOLLOW(A).

Rules

• For the FOLLOW(Start symbol) place $, where $ is the input end marker.

• If there is a production A -> α B β, then everything in FIRST(β) except Ɛis in FOLLOW(A).

• If there is a production A -> α B, or a production A -> α B β where FIRST((β) contains Ɛ, then everything in FOLLOW(A) is in FOLLOW(B).

Construction of parsing table

Algorithm Construction of predictive parsing table

Input Grammar G

Output Parsing table M

Method For each production A –> α, do the following:

1. For each terminal αin FIRST(α), add A –>αto M[A, a].

2. If Ɛis in FIRST (α) , then for each terminal b in FOLLOW(A) ‘ add A –> αto M[A, b]

3. If Ɛ is in FIRST (Ɛ) and $ is in FOLLOW(A) , add A –> αto M[A, $] .

4. If no production is found in M[A, a] then set error to M[A, a].

Note:

In general, parsing table entry will be empty for indicating error status.

Parsing of input

Predictive parser contains the following components:

• Stack – holds sequence of grammar symbols with $ on the bottom of stack

• Input buffer – contains the input to be parsed with $ as an end marker for the string.

• Parsing table.

Process

• Initially the stack contains $ to indicate bottom of the stack and the start symbol of grammar on top of $.

• The input string is placed in input buffer with $ at the end to indicate the end of the string.

• Parsing algorithm refers the grammar symbol on the top of stack and input symbol pointed by the pointer and consults the entry in M[A, α] where A is in top of stack and αis the symbol read by the pointer.

• Based on the table entry, if a production is found then the tail of the production is pushed onto stack in reversal order with leftmost symbol on the top of stack.

• Process repeats until the entire string is processed.

• When the stack contains $ (bottom end marker) and the pointer reads $ (end of input string), successful parsing occurs.

• If no entry is found, it reports error stating that the input string cannot be parsed by the grammar.

Algorithm Table-driven predictive parsing

Input A string w and parsing table M for a grammar G

Output If w is in L(G) then success; otherwise error

Method

Let a be the first symbol of w;

Let X be the top of stack symbol;

while(X ≠ $)

if (X = a) pop the stack and let a be the next

symbol of w;

else if (X is a terminal) error();

else if (M[X, a] is an error entry) error();

else if (M[X, a]=X–>Y₁Y₂…Y_k){

output the production X–>Y₁Y₂…Y_k;

pop the stack;

push Y_k,Y_k-1 r … ,Y₁ onto

stack with Y₁ on the top;}

Let X be the top stack symbol; }

Non-recursive Predictive Parser

Non-recursive predictive parser uses explicit stack data structure.

This prevents implicit recursive calls.

It can also be termed as table-driven predictive parser.

Components

• Input buffer – holds input string to be parsed.

• Stack – holds sequence of grammar symbols.

• Predictive parsing algorithm – contains steps to parse the input string; controls the parser’s process.

• Parsing table – contains entries based on which parsing action has to be carried out.

Process

• Initially, the stack contains $ at the bottom of the stack.

• The input string to be parsed is placed in the input buffer with $ as the end marker.

• If X is a non-terminal on the top of stack and the input symbol being read is a, the parser chooses a production by consulting entry in the parsing table M[X, a].

• Replace the non-terminal in stack with the production found in M[X, a] in such a way that the leftmost symbol of right side of production is on the top of stack i.e., the production has to be pushed to stack in reverse order.

• Compare the top of stack symbol with input symbol.

• If it matches, pop the symbol from stack and advance the pointer reading the input buffer.

• If no match is found repeat from step 2.

• Stop parsing when the stack is empty (holds $) and input buffer reads end marker ($).

Error Recovery in Predictive Parsing

• Recovery in a non-recursive predictive parser is easier than in a recursive descent parser.

• Panic mode recovery

o If a terminal on stack, pop the terminal.

o If a non-terminal on stack, shift the input until the terminal can expand.

• Phrase level recovery

o Carefully filling in the blank entries about what to do.

BOTTOM-UP PARSING

• Bottom-up parsers construct parse trees starting from the leaves and work up to the root.

• Bottom-up syntax analysis is also termed as shift-reduce parsing.

• The common method of shift-reduce parsing is called LR parsing.

• Operator precedence parsing is an easy-to-implement shift-reduce parser.

• Shift-reduce parsing try to build a parse tree for an input string beginning at the leaves (the bottom) and working up towards the root (the top).

• At each and every step of reduction, the right side of a production which matches with the substring is replaced by the left side symbol of the production.

• If the substring is chosen correctly at each step, a rightmost derivation is traced out in reverse.

Handles

A handle of a string is a substring that matches the right side of a production and whose reduction to the non-terminal on the left side of the production represents one step along the reverse of a rightmost derivation.

Precise definition of a handle

• A handle of a right-sentential form ɣ is a production A –>βand a position of ɣ where the string βmay be found and replaced by A to produce the previous right-sentential form in a rightmost derivation of ɣ.

• The string w to the right of the handle contains only terminal symbols.

(eg.) Consider the grammar

S –>αABe

A –> Abc I b

B –> d

The sentence abbcde can be reduced to S by the following steps.

abbcde

aAbcde

aAde

aABe

These reductions trace out the following rightmost derivation in reverse.

Handle Pruning

• If A –>β is a production then reducing βto A by the given production is called handle pruning i.e., removing the children of A from the parse tree.

• A rightmost derivation in reverse can be obtained by handle pruning.

Start with a string of terminals ω that is to parse. If ω is a sentence of the grammar at hand, then ω= ɣ_n where ɣ_n is the nth right sentential form of some as yet unknown rightmost derivation.

Example for right sentential form and handle for grammar

E –> E+E

E –> E*E

E –> (E)

E –> id

Right sentential form	Handle	Reduction production
id₁+ id₂ * id₃	id₁	E –> id
E + id₂ * id₃	id₂	E –> id
E + E * id₃	Id₃	E –> id
E + E * E	E * E	E –> E * E
E + E	E + E	E –> E + E
E

Shift-reduce Parsing

i) Shift Reduce parsing is a bottom-up parsing that reduces a string w to the start symbol of grammar.

ii) It scans and parses the input text in one forward pass without backtracking.

Stack implementation of shift-reduce parsing

• Handle pruning must solve the following two problems to perform parsing:

o Locating the substring to be reduced in a right sentential form.

o Determining what production to choose in case there is more than one productions with that substring on the right side.

• The type of data structure to use in a shift-reduce parser.

Implementation of shift-reduce parser

Shift-reduce parser can be implemented by using the following components:

• Stack is used to hold grammar symbols.

• An input buffer is used to hold the string w to be parsed.

• $ is used to mark the bottom of the stack and also the right end of the input.

• Initially the stack is empty and the string ωis on the input, as follows:

Stack Input

$ ω $

• The parser processes by shifting zero or more input symbols onto the stack until a handle β is on top of the stack.

• The parser then reduces β to the left side of the appropriate production.

• The parser repeats this cycle until it has detected an error or until the stack contains the start symbol and the input is empty.

Stack Input

$S $

• When the input buffer reaches the end marker symbol $ and the stack contains the start symbol, the parser halts and announces successful completion of parsing.

Actions in shift-reduce parser

A shift-reduce parser can make four possible actions viz: 1) shift 2) reduce 3) accept 4) error.

• A shift action, shifts the next symbol onto the top of the stack.

• A reduce action, replaces the symbol on the right side of production by the symbol on left side of the production concerned.

To perform reduction, the parser must know the right end of the handle which is at the top of the stack. Then the left end of the handle within the stack is located and the non-terminal to replace the handle is decided.

• An accept action, initiates the parser to announce successful completion of parsing.

• An error action, discovers that a syntax error has occurred and calls an error recovery routine.

Note:

An important fact that justifies the use of a stack in shift-reduce parsing is that the handle will always appear on top of the stack and never inside.

(eg.) Consider the grammar

E –> E+E

E –> E*E

E –> (E)

E –> id

and the input string id₁+ id₂ * id₃. Use the shift-reduce parser to check whether the input string is accepted by the above grammar.

Stack	Input	Action
$	id₁+ id₂ * id₃$	shift
$ id₁	+ id₂ * id₃$	reduce by E –> id
$ E	+ id₂ * id₃$	shift
$ E+	id₂ * id₃$	shift
$ E+ E	* id₃$	shift
$ E+ E *	id₃$	shift
$ E+ E * id₃	$	reduce by E –> id
$ E+ E * E	$	reduce by E –> E * E
$ E+ E	$	reduce by E –> E + E
$E	$	accept

Viable prefixes

The set of prefixes of right sentential forms that can appear on the stack of a shift-reduce parser are called viable prefixes.

Conflicts during shift-reduce parsing

• Shift-reduce parsing cannot be used in context free grammar.

• For every shift-reduce parser, such grammar can reach a configuration in which the parser cannot decide whether to shift or to reduce (a shift-reduce conflict), or cannot decide which of the several reductions to make (a reduce/reduce conflict), by knowing the entire stack contents and the next input symbol.

(eg.)

• An ambiguous grammar can never be LR. Consider dangling-else grammar,

Stmt–>if expr then stmt

I if expr then stmt else stmt

I other

• In this grammar a shift/reduce conflict occurs for some input string.

• So this grammer is not LR(l) grammar.

What is Parse Tree? – Compiler Design

By Dinesh Thakur

• Parse tree is a hierarchical structure which represents the derivation of the grammar to yield input strings.

• Root node of parse tree has the start symbol of the given grammar from where the derivation proceeds.

• Leaves of parse tree represent terminals.

• Each interior node represents productions of grammar.

• If A -> xyz is a production, then the parse tree will have A as interior node whose children are x, y and z from its left to right.

Construct parse tree for E –> E + E I E * E I id

Construct parse tree for s –> SS* I ss+ I a

Yield of Parse Tree

Leaf nodes of parse tree are concatenated from left to right to form the input string derived from a grammar which is called yield of parse tree.

Figure represents the parse tree for the string id+ id* id.
The string id + id * id, is the yield of parse tree depicted in Fig.

Figure represents the parse tree for the string aa +a*.
The string aa + a*, is the yield of parse tree depicted in Fig.

What is Derivations? – Compiler Design

By Dinesh Thakur

Derivation is used to find whether the string belongs to a given grammar.

Types

• Leftmost derivation.

• Rightmost derivation.

Leftmost Derivation

In leftmost derivation, at each and every step the leftmost non-terminal is expanded by substituting its corresponding production to derive a string.

Example

Rightmost Derivation

In rightmost derivation, at each and every step the rightmost non-terminal is expanded by substituting its corresponding production to derive a string.

Example

What is Context Free Grammars? Compiler Design

By Dinesh Thakur

Grammars are used to describe the syntax of a programming language. It specifies the structure of expression and statements.

stmt -> if (expr) then stmt

where stmt denotes statements,

expr denotes expressions.

Types of grammar

• Type 0 grammar

• Type 1 grammar

• Type 2 grammar

• Type 3 grammar

Context Free Grammar

Context free grammar is also called as Type 2 grammar.

Definition

A context free grammar G is defined by four tuples as,

G=(V,T,P,S)

where,

G – Grammar

V – Set of variables

T – Set of Terminals

P – Set of productions

S – Start symbol

It produces Context Free Language (CFL) which is defined as,

Context Free Language (CFL)

where,

L-Language

G- Grammar

w – Input string

S – Start symbol

T – Terminal

Hence, CFL is a collection of input strings which are terminals, derived from the start symbol of grammar on multiple steps.

Conventions

Terminals are symbols from which strings are formed.

• Lowercase letters i.e., a, b, c.

• Operators i.e.,+,-,*·

• Punctuation symbols i.e., comma, parenthesis.

• Digits i.e. 0, 1, 2, · · · ,9.

• Boldface letters i.e., id, if.

Non-terminals are syntactic variables that denote a set of strings.

Uppercase letters i.e., A, B, C.

Lowercase italic names i.e., expr , stmt.

Start symbol is the head of the production stated first in the grammar.

Production is of the form LHS ->RHS (or) head -> body, where head contains only one non-terminal and body contains a collection of terminals and non-terminals.

(eg.) Let G be,

Context Free Grammars vs Regular Expressions

Grammars are more powerful than regular expressions.

Every construct that can be described by a regular expression can be described by a grammar but not vice-versa.

Every regular language is a context free language but reverse does not hold.

(eg.)

RE= (a I b)*abb (set of strings ending with abb).

Grammar

Rules

For each state i of the NFA, create a non-terminal A_i.

If state i has a transition to state j on input a, add the production A_i -> aA_j.

If state i goes to state j on input e, add the production A_i -> A_j.

If i is an accepting state, add A_i -> Ɛ.

If i is a start state, make Ai be the start symbol of the grammar.

What is Parser (Syntax analysis)? Error Handling and Recovery in Syntax Analyzer

By Dinesh Thakur

Syntax analysis is the second phase of compiler.

Syntax analysis is also known as parsing.

Parsing is the process of determining whether a string of tokens can be generated by a grammar.

It is performed by syntax analyzer which can also be termed as parser.

In addition to construction of the parse tree, syntax analysis also checks and reports syntax errors accurately.

(eg.)

C = a + b * 5

Syntax tree can be given as,

Syntax analysis (parser)

Parser is a program that obtains tokens from lexical analyzer and constructs the parse tree which is passed to the next phase of compiler for further processing.

Parser implements context free grammar for performing error checks.

Types of Parser

• Top down parsers Top down parsers construct parse tree from root to leaves.

• Bottom up parsers Bottom up parsers construct parse tree from leaves to root.

Role of Parser

Figure depicts the role of parser with respect to other phases.

• Once a token is generated by the lexical analyzer, it is passed to the parser.

• On receiving a token, the parser verifies the string of token names that can be generated by the grammar of source language.

• It calls the function getNextToken(), to notify the lexical analyzer to yield another token.

• It scans the token one at a time from left to right to construct the parse tree.

• It also checks the syntactic constructs of the grammar.

Position of parser in compiler

Need for Parser

• Parser is needed to detect syntactic errors efficiently.

• Error is detected as soon as a prefix of the input cannot be completed to form a string in the language. This process of analyzing the prefix of input is called viable-prefix property.

Error Recovery Strategies

Error recovery strategies are used by the parser to recover from errors once it is detected. The simplest recovery strategy is to quit parsing with an error message for the first error itself.

Panic Mode Recovery

Once an error is found, the parser intends to find designated set of synchronizing tokens by discarding input symbols one at a time.

Synchronizing tokens are delimiters, semicolon or } whose role in source program is clear.

• When parser finds an error in the statement, it ignores the rest of the statement by not processing the input.

• This is the easiest way of error-recovery.

• It prevents the parser from developing infinite loops.

Advantages

• Simplicity.

• Never get into infinite loop.

Disadvantage

• Additional errors cannot be checked as some of the input symbols will be skipped.

Phrase Level Recovery

Parser performs local correction on the remaining input when an error is detected.

• When a parser finds an error, it tries to take corrective measures so that the rest of inputs of statement allow the parser to parse ahead.

• One wrong correction will lead to an infinite loop.

The local correction may be

• Replacing a prefix by some string.

• Replacing comma by semicolon.

• Deleting extraneous semicolon.

• Insert missing semicolon.

Advantage

• It can correct any input string.

Disadvantage

• It is difficult to cope up with actual error if it has occurred before the point of detection.

Error Production

Productions which generate erroneous constructs are augmented to the grammar by considering common errors that occur.

These productions detect the anticipated errors during parsing.

Error diagnostics about the erroneous constructs are generated by the parser.

Global Correction

There are algorithms which make changes to modify an incorrect string into a correct string.

These algorithms perform minimal sequence of changes to obtain globally least-cost correction.

When a grammar G and an incorrect string pis given, these algorithms find a parse tree for a string q related top with smaller number of transformations.

The transformations may be insertions, deletions and change of tokens.

Advantage

• It has been used for phrase level recovery to find optimal replacement strings.

Disadvantage

• This strategy is too costly to implement in terms of time and space.

What is LEX? Use of Lex.

By Dinesh Thakur

• Lex is a tool in lexical analysis phase to recognize tokens using regular expression.

• Lex tool itself is a lex compiler.

Use of Lex

Creating lexical analyzer

• lex.l is an a input file written in a language which describes the generation of lexical analyzer. The lex compiler transforms lex.l to a C program known as lex.yy.c.

• lex.yy.c is compiled by the C compiler to a file called a.out.

• The output of C compiler is the working lexical analyzer which takes stream of input characters and produces a stream of tokens.

• yylval is a global variable which is shared by lexical analyzer and parser to return the name and an attribute value of token.

• The attribute value can be numeric code, pointer to symbol table or nothing.

• Another tool for lexical analyzer generation is Flex.

Structure of Lex Programs

Lex program will be in following form

declarations

translation rules

auxiliary functions

Declarations This section includes declaration of variables, constants and regular definitions.

Translation rules It contains regular expressions and code segments.

Form : Pattern {Action}

Pattern is a regular expression or regular definition.

Action refers to segments of code.

Auxiliary functions This section holds additional functions which are used in actions. These functions are compiled separately and loaded with lexical analyzer.

Lexical analyzer produced by lex starts its process by reading one character at a time until a valid match for a pattern is found.

Once a match is found, the associated action takes place to produce token.

The token is then given to parser for further processing.

Conflict Resolution in Lex

Conflict arises when several prefixes of input matches one or more patterns. This can be resolved by the following:

• Always prefer a longer prefix than a shorter prefix.

• If two or more patterns are matched for the longest prefix, then the first pattern listed in lex program is preferred.

Lookahead Operator

• Lookahead operator is the additional operator that is read by lex in order to distinguish additional pattern for a token.

• Lexical analyzer is used to read one character ahead of valid lexeme and then retracts to produce token.

• At times, it is needed to have certain characters at the end of input to match with a pattern. In such cases, slash (/) is used to indicate end of part of pattern that matches the lexeme.

(eg.) In some languages keywords are not reserved. So the statements

IF (I, J) = 5 and IF(condition) THEN

results in conflict whether to produce IF as an array name or a keyword. To resolve this the lex rule for keyword IF can be written as,

IF/\ (.* \) {

letter }

Design of Lexical Analyzer

• Lexical analyzer can either be generated by NFA or by DFA.

• DFA is preferable in the implementation of lex.

Structure of Generated Analyzer

Architecture of lexical analyzer generated by lex is given in Fig.

Lex program used by finite automaton simulator

Lexical analyzer program includes:

• Program to simulate automata

• Components created from lex program by lex itself which are listed as follows:

o A transition table for automaton.

o Functions that are passed directly through lex to the output.

o Actions from input program (fragments of code) which are invoked by automaton simulator when needed.

Steps to construct automaton

Step 1: Convert each regular expression into NFA either by Thompson’s subset construction or Direct Method.

Step 2: Combine all NFAs into one by introducing new start state with s-transitions to each of start states of NFAs N_i for pattern P_i·

Step 2 is needed as the objective is to construct single automaton to recognize lexemes that matches with any of the patterns.

Construction of NFA from lex program

(eg.) a {action A₁ for pattern P_l}

abb { action A₂ for pattern P₂ }

a*b+ { action A₃ for pattern P₃ }

For string obb, pattern P₂ and pattern p₃ matches. But the pattern P₂ will be taken into account as it was listed first in lex program.

For string aabbb · · · , matches pattern p₃ as it has many prefixes.

Fig. Shows NFAs for recognizing the above mentioned three patterns.

The combined NFA for all three given patterns is shown in Fig.

NFA's for a, abb, a*b+

Combined NFA

Pattern Matching Based on NFAs

Lexical analyzer reads input from input buffer from the beginning of lexeme pointed by the pointer lexemeBegin. Forward pointer is used to move ahead of input symbols, calculates the set of states it is in at each point. If NFA simulation has no next state for some input symbol, then there will be no longer prefix which reaches the accepting state exists. At such cases, the decision will be made on the so seen longest prefix i.e., lexeme matching some pattern. Process is repeated until one or more accepting states are reached. If there are several accepting states, then the pattern P_i which appears earliest in the list of lex program is chosen.

e.g.

W= aaba

Explanation

Process starts with s-closure of initial state 0. After processing all the input symbols, no state is found as there is no transition out of state 8 on input a. Hence, look for accepting state by retracting to previous state. From Fig. state 2 which is an accepting state is reached after reading input symbol a and therefore the pattern a has been matched. At state 8, string aab has been matched with pattern avb”: By Lex rule, the longest matching prefix should be considered. Hence, action Ag corresponds to pattern p3 will be executed for the string aab.

DFAs for Lexical Analyzers

DFAs are also used to represent the output oflex. DFA is constructed from NFA, by converting all the patterns into equivalent DFA using subset construction algorithm. If there are one or more accepting NFA states, the first pattern whose accepting state is represented in each DFA state is determined and displayed as output of DFA state. Process of DFA is similar to that of NFA. Simulation of DFA is continued until no next state is found. Then retraction takes place to find the accepting state of DFA. Action associated with the pattern for that state is executed.

Implementing Lookahead Operator

Lookahead operator r₁/r₂ is needed because the pattern r₁ for a particular token may need to describe some trailing context r₂in order to correctly identify the actual lexeme.

For the pattern r₁/r₂, ‘/’ is treated as Ɛ.

If some prefix ab, is recognized by NFA as a match for regular expression then the lexeme is not ended as NFA reaches the accepting state.

The end of lexeme occurs when NFA enters a state p such that

• p has an Ɛ -transition on I,

• There is a path from start state to state p, that spells out a.

• There is a path from state p to accepting state that spells out b.

• a is as Jong as possible for any ab satisfying conditions 1 – 3.

NFA for keyword IF

Figure shows the NFA for recognizing the keyword IF with lookahead. Transition from state 2 to state 3 represents the lookahead operator (-transition).

Accepting state is state 6, which indicates the presence of keyword IF. Hence, the lexeme IF is found by looking backwards to the state 2, whenever accepting state (state 6) is reached.

Convert Regular Expression to DFA – Compiler Design

By Dinesh Thakur

Regular expression is used to represent the language (lexeme) of finite automata (lexical analyzer). [Read more…] about Convert Regular Expression to DFA – Compiler Design

Regular Expression – Compiler Design

By Dinesh Thakur

• Regular expressions are a notation to represent lexeme patterns for a token.

• They are used to represent the language for lexical analyzer.

• They assist in finding the type of token that accounts for a particular lexeme. [Read more…] about Regular Expression – Compiler Design

Input Buffering – Compiler Design

By Dinesh Thakur

• To ensure that a right lexeme is found, one or more characters have to be looked up beyond the next lexeme.

• Hence a two-buffer scheme is introduced to handle large lookaheads safely.

• Techniques for speeding up the process of lexical analyzer such as the use of sentinels to mark the buffer end have been adopted.

There are three general approaches for the implementation of a lexical analyzer:

(i) By using a lexical-analyzer generator, such as lex compiler to produce the lexical analyzer from a regular expression based specification. In this, the generator provides routines for reading and buffering the input.

(ii) By writing the lexical analyzer in a conventional systems-programming language, using I/O facilities of that language to read the input.

(iii) By writing the lexical analyzer in assembly language and explicitly managing the reading of input.

Buffer Pairs

Because of large amount of time consumption in moving characters, specialized buffering techniques have been developed to reduce the amount of overhead required to process an input character.

Fig shows the buffer pairs which are used to hold the input data.

Buffer pairs

Scheme

• Consists of two buffers, each consists of N-character size which are reloaded alternatively.

• N-Number of characters on one disk block, e.g., 4096.

• N characters are read from the input file to the buffer using one system read command.

• eof is inserted at the end if the number of characters is less than N.

Pointers

Two pointers lexemeBegin and forward are maintained.

lexeme Begin points to the beginning of the current lexeme which is yet to be found.

forward scans ahead until a match for a pattern is found.

• Once a lexeme is found, lexemebegin is set to the character immediately after the lexeme which is just found and forward is set to the character at its right end.

• Current lexeme is the set of characters between two pointers.

Disadvantages of this scheme

• This scheme works well most of the time, but the amount of lookahead is limited.

• This limited lookahead may make it impossible to recognize tokens in situations where the distance that the forward pointer must travel is more than the length of the buffer.

(eg.) DECLARE (ARGl, ARG2, . . . , ARGn) in PL/1 program;

• It cannot determine whether the DECLARE is a keyword or an array name until the character that follows the right parenthesis.

Sentinels

• In the previous scheme, each time when the forward pointer is moved, a check is done to ensure that one half of the buffer has not moved off. If it is done, then the other half must be reloaded.

• Therefore the ends of the buffer halves require two tests for each advance of the forward pointer.

Test 1: For end of buffer.

Test 2: To determine what character is read.

• The usage of sentinel reduces the two tests to one by extending each buffer half to hold a sentinel character at the end.

• The sentinel is a special character that cannot be part of the source program. (eof character is used as sentinel).

Sentinels at the end of each buffer

Advantages

• Most of the time, It performs only one test to see whether forward pointer points to an eof.

• Only when it reaches the end of the buffer half or eof, it performs more tests.

• Since N input characters are encountered between eofs, the average number of tests per input character is very close to 1.

Lexical Analysis – Compiler Design

By Dinesh Thakur

Lexical analysis is the process of converting a sequence of characters from source program into a sequence of tokens.

A program which performs lexical analysis is termed as a lexical analyzer (lexer), tokenizer or scanner.

Lexical analysis consists of two stages of processing which are as follows:

• Scanning

• Tokenization

Token, Pattern and Lexeme

Token

Token is a valid sequence of characters which are given by lexeme. In a programming language,

• keywords,

• constant,

• identifiers,

• numbers,

• operators and

• punctuations symbols

are possible tokens to be identified.

Pattern

Pattern describes a rule that must be matched by sequence of characters (lexemes) to form a token. It can be defined by regular expressions or grammar rules.

Lexeme

Lexeme is a sequence of characters that matches the pattern for a token i.e., instance of a

token.

(eg.) c=a+b*5;

Lexemes and tokens

Lexemes	Tokens
c	identifier
=	assignment symbol
a	identifier
+	+ (addition symbol)
b	identifier
*	* (multiplication symbol)
5	5 (number)

The sequence of tokens produced by lexical analyzer helps the parser in analyzing the syntax of programming languages.

Role of Lexical Analyzer

Interaction between lexical analyzer and parser

Lexical analyzer performs the following tasks:

• Reads the source program, scans the input characters, group them into lexemes and produce the token as output.

• Enters the identified token into the symbol table.

• Strips out white spaces and comments from source program.

• Correlates error messages with the source program i.e., displays error message with its occurrence by specifying the line number.

• Expands the macros if it is found in the source program.

Tasks of lexical analyzer can be divided into two processes:

Scanning: Performs reading of input characters, removal of white spaces and comments.

Lexical Analysis: Produce tokens as the output.

Need of Lexical Analyzer

Simplicity of design of compiler The removal of white spaces and comments enables the syntax analyzer for efficient syntactic constructs.

Compiler efficiency is improved Specialized buffering techniques for reading characters speed up the compiler process.

Compiler portability is enhanced

Issues in Lexical Analysis

Lexical analysis is the process of producing tokens from the source program. It has the following issues:

• Lookahead

• Ambiguities

Lookahead

Lookahead is required to decide when one token will end and the next token will begin. The simple example which has lookahead issues are i vs. if, = vs. ==. Therefore a way to describe the lexemes of each token is required.

A way needed to resolve ambiguities

• Is if it is two variables i and f or if?

• Is == is two equal signs =, = or ==?

• arr(5, 4) vs. fn(5, 4) II in Ada (as array reference syntax and function call syntax are similar.

Hence, the number of lookahead to be considered and a way to describe the lexemes of each token is also needed.

Regular expressions are one of the most popular ways of representing tokens.

Ambiguities

The lexical analysis programs written with lex accept ambiguous specifications and choose the longest match possible at each input point. Lex can handle ambiguous specifications. When more than one expression can match the current input, lex chooses as follows:

• The longest match is preferred.

• Among rules which matched the same number of characters, the rule given first is preferred.

Lexical Errors

• A character sequence that cannot be scanned into any valid token is a lexical error.

• Lexical errors are uncommon, but they still must be handled by a scanner.

• Misspelling of identifiers, keyword, or operators are considered as lexical errors.

Usually, a lexical error is caused by the appearance of some illegal character, mostly at the beginning of a token.

Error Recovery Schemes

• Panic mode recovery

• Local correction

o Source text is changed around the error point in order to get a correct text.

o Analyzer will be restarted with the resultant new text as input.

• Global correction

o It is an enhanced panic mode recovery.

o Preferred when local correction fails.

Panic mode recovery

In panic mode recovery, unmatched patterns are deleted from the remaining input, until the lexical analyzer can find a well-formed token at the beginning of what input is left.

(eg.) For instance the string fi is encountered for the first time in a C program in the context:

fi (a== f(x))

A lexical analyzer cannot tell whether f iis a misspelling of the keyword if or an undeclared function identifier.

Since f i is a valid lexeme for the token id, the lexical analyzer will return the token id to the parser.

Local correction

Local correction performs deletion/insertion and/or replacement of any number of symbols in the error detection point.

(eg.) In Pascal, c[i] ‘=’; the scanner deletes the first quote because it cannot legally follow the closing bracket and the parser replaces the resulting’=’ by an assignment statement.

Most of the errors are corrected by local correction.

(eg.) The effects of lexical error recovery might well create a later syntax error, handled by the parser. Consider

· · · for $tnight · · ·

The $ terminates scanning of for. Since no valid token begins with $, it is deleted. Then tnight is scanned as an identifier.

In effect it results,

· · · fortnight · · ·

Which will cause a syntax error? Such false errors are unavoidable, though a syntactic error-repair may help.

Lexical error handling approaches

Lexical errors can be handled by the following actions:

• Deleting one character from the remaining input.

• Inserting a missing character into the remaining input.

• Replacing a character by another character.

• Transposing two adjacent characters.

Compiler Construction tools – Compiler Design

By Dinesh Thakur

Some commonly used compiler-construction tools. include

1. Parser generators.
2. Scanner generators.
3. Syntax-directed translation engines.
4. Automatic code generators.
5. Data-flow analysis engines.
6. Compiler-construction toolkits.

Parser Generators

Input: Grammatical description of a programming language
Output: Syntax analyzers.

Parser generator takes the grammatical description of a programming language and produces a syntax analyzer.

Scanner Generators

Input: Regular expression description of the tokens of a language
Output: Lexical analyzers.
Scanner generator generates lexical analyzers from a regular expression description of the tokens of a language.

Syntax-directed Translation Engines

Input: Parse tree.
Output: Intermediate code.
Syntax-directed translation engines produce collections of routines that walk a parse tree and generates intermediate code.

Automatic Code Generators

Input: Intermediate language.
Output: Machine language.
Code-generator takes a collection of rules that define the translation of each operation of the intermediate language into the machine language for a target machine.

Data-flow Analysis Engines

Data-flow analysis engine gathers the information, that is, the values transmitted from one part of a program to each of the other parts. Data-flow analysis is a key part of code optimization.

Compiler Construction Toolkits

The toolkits provide integrated set of routines for various phases of compiler. Compiler construction toolkits provide an integrated set of routines for construction of phases of compiler.

Grouping of Phases – Compiler Design

By Dinesh Thakur

The phases of a compiler can be grouped as:

Front end

Front end of a compiler consists of the phases

• Lexical analysis.
• Syntax analysis.
• Semantic analysis.
• Intermediate code generation.

Back end

Back end of a compiler contains

• Code optimization.
• Code generation.

Front End

• Front end comprises of phases which are dependent on the input (source language) and independent on the target machine (target language).
• It includes lexical and syntactic analysis, symbol table management, semantic analysis and the generation of intermediate code.
• Code optimization can also be done by the front end.
• It also includes error handling at the phases concerned.

Front end

Back End

• Back end comprises of those phases of the compiler that are dependent on the target machine and independent on the source language.
• This includes code optimization, code generation.
• In addition to this, it also encompasses error handling and symbol table management operations.

Back end

Passes

• The phases of compiler can be implemented in a single pass by marking the primary actions viz. reading of input file and writing to the output file.
• Several phases of compiler are grouped into one pass in such a way that the operations in each and every phase are incorporated during the pass.
• (eg.) Lexical analysis, syntax analysis, semantic analysis and intermediate code generation might be grouped into one pass. If so, the token stream after lexical analysis may be translated directly into intermediate code.

Reducing the Number of Passes

• Minimizing the number of passes improves the time efficiency as reading from and writing to intermediate files can be reduced.
• When grouping phases into one pass, the entire program has to be kept in memory to ensure proper information flow to each phase because one phase may need information in a different order than the information produced in previous phase.
The source program or target program differs from its internal representation. So, the memory for internal form may be larger than that of input and output.

Phases of Compiler – Compiler Design

By Dinesh Thakur

The structure of compiler consists of two parts: [Read more…] about Phases of Compiler – Compiler Design

Compiler Design – Language Processing System

By Dinesh Thakur

Pre-processor

A source program may be divided into modules stored in separate files. The task of collecting the source program is entrusted to a separate program called pre-processor. It may also expand macros into source language statement.

Compiler

Compiler is a program that takes source program as input and produces assembly language program as output.

Assembler

Assembler is a program that converts assembly language program into machine language program. It produces re-locatable machine code as its output.

Loader and link-editor

• The re-locatable machine code has to be linked together with other re-locatable object files and library files into the code that actually runs on the machine.
• The linker resolves external memory addresses, where the code in one file may refer to a location in another file.
• The loader puts together the entire executable object files into memory for execution.

What is Translators? Different type of translators

By Dinesh Thakur

A program written in high-level language is called as source code. To convert the source code into machine code, translators are needed.

A translator takes a program written in source language as input and converts it into a program in target language as output.

It also detects and reports the error during translation.

Roles of translator are:

• Translating the high-level language program input into an equivalent machine language program.

• Providing diagnostic messages wherever the programmer violates specification of the high-level language program.

Different type of translators

The different types of translator are as follows:

Compiler

Compiler is a translator which is used to convert programs in high-level language to low-level language. It translates the entire program and also reports the errors in source program encountered during the translation.

Compiler

Interpreter

Interpreter is a translator which is used to convert programs in high-level language to low-level language. Interpreter translates line by line and reports the error once it encountered during the translation process.

It directly executes the operations specified in the source program when the input is given by the user.

It gives better error diagnostics than a compiler.

Interpreter

Differences between compiler and interpreter

SI. No	Compiler	Interpreter
1	Performs the translation of a program as a whole.	Performs statement by statement translation.
2	Execution is faster.	Execution is slower.
3	Requires more memory as linking is needed for the generated intermediate object code.	Memory usage is efficient as no intermediate object code is generated.
4	Debugging is hard as the error messages are generated after scanning the entire program only.	It stops translation when the first error is met. Hence, debugging is easy.
5	Programming languages like C, C++ uses compilers.	Programming languages like Python, BASIC, and Ruby uses interpreters.

Assembler

Assembler is a translator which is used to translate the assembly language code into machine language code.

Assembler