• Skip to main content
  • Skip to primary sidebar
  • Skip to secondary sidebar
  • Skip to footer

Computer Notes

Library
    • Computer Fundamental
    • Computer Memory
    • DBMS Tutorial
    • Operating System
    • Computer Networking
    • C Programming
    • C++ Programming
    • Java Programming
    • C# Programming
    • SQL Tutorial
    • Management Tutorial
    • Computer Graphics
    • Compiler Design
    • Style Sheet
    • JavaScript Tutorial
    • Html Tutorial
    • Wordpress Tutorial
    • Python Tutorial
    • PHP Tutorial
    • JSP Tutorial
    • AngularJS Tutorial
    • Data Structures
    • E Commerce Tutorial
    • Visual Basic
    • Structs2 Tutorial
    • Digital Electronics
    • Internet Terms
    • Servlet Tutorial
    • Software Engineering
    • Interviews Questions
    • Basic Terms
    • Troubleshooting
Menu

Header Right

Home » Compiler » Lexical Analysis – Compiler Design
Next →
← Prev

Lexical Analysis – Compiler Design

By Dinesh Thakur

Lexical analysis is the process of converting a sequence of characters from source program into a sequence of tokens.

A program which performs lexical analysis is termed as a lexical analyzer (lexer), tokenizer or scanner.

Lexical analysis consists of two stages of processing which are as follows:

• Scanning

• Tokenization

We’ll be covering the following topics in this tutorial:

  • Token, Pattern and Lexeme
  • Role of Lexical Analyzer
  • Need of Lexical Analyzer
  • Issues in Lexical Analysis
  • Lexical Errors

Token, Pattern and Lexeme

Token

Token is a valid sequence of characters which are given by lexeme. In a programming language,

• keywords,

• constant,

• identifiers,

• numbers,

• operators and

• punctuations symbols

are possible tokens to be identified.

Pattern

Pattern describes a rule that must be matched by sequence of characters (lexemes) to form a token. It can be defined by regular expressions or grammar rules.

Lexeme

Lexeme is a sequence of characters that matches the pattern for a token i.e., instance of a

token.

(eg.) c=a+b*5;

                                               Lexemes and tokens

 

Lexemes

Tokens

c

identifier

=

assignment symbol

a

identifier

+

+ (addition symbol)

b

identifier

*

* (multiplication symbol)

5

5 (number)

 

 

 

 

 

 

 

 

 

 

 

 

 

The sequence of tokens produced by lexical analyzer helps the parser in analyzing the syntax of programming languages.

Role of Lexical Analyzer

                         Interaction between lexical analyzer and parser

Lexical analyzer performs the following tasks:

• Reads the source program, scans the input characters, group them into lexemes and produce the token as output.

• Enters the identified token into the symbol table.

• Strips out white spaces and comments from source program.

• Correlates error messages with the source program i.e., displays error message with its occurrence by specifying the line number.

• Expands the macros if it is found in the source program.

Tasks of lexical analyzer can be divided into two processes:

Scanning: Performs reading of input characters, removal of white spaces and comments.

Lexical Analysis: Produce tokens as the output.

Need of Lexical Analyzer

Simplicity of design of compiler The removal of white spaces and comments enables the syntax analyzer for efficient syntactic constructs.

Compiler efficiency is improved Specialized buffering techniques for reading characters speed up the compiler process.

Compiler portability is enhanced

Issues in Lexical Analysis

Lexical analysis is the process of producing tokens from the source program. It has the following issues:

• Lookahead

• Ambiguities

Lookahead

Lookahead is required to decide when one token will end and the next token will begin. The simple example which has lookahead issues are i vs. if, = vs. ==. Therefore a way to describe the lexemes of each token is required.

A way needed to resolve ambiguities

• Is if it is two variables i and f or if?

• Is == is two equal signs =, = or ==?

• arr(5, 4) vs. fn(5, 4) II in Ada (as array reference syntax and function call syntax are similar.

Hence, the number of lookahead to be considered and a way to describe the lexemes of each token is also needed.

Regular expressions are one of the most popular ways of representing tokens.

Ambiguities

The lexical analysis programs written with lex accept ambiguous specifications and choose the longest match possible at each input point. Lex can handle ambiguous specifications. When more than one expression can match the current input, lex chooses as follows:

• The longest match is preferred.

• Among rules which matched the same number of characters, the rule given first is preferred.

Lexical Errors

• A character sequence that cannot be scanned into any valid token is a lexical error.

• Lexical errors are uncommon, but they still must be handled by a scanner.

• Misspelling of identifiers, keyword, or operators are considered as lexical errors.

Usually, a lexical error is caused by the appearance of some illegal character, mostly at the beginning of a token.

Error Recovery Schemes

• Panic mode recovery

• Local correction

   o Source text is changed around the error point in order to get a correct text.

   o Analyzer will be restarted with the resultant new text as input.

• Global correction

   o It is an enhanced panic mode recovery.

   o Preferred when local correction fails.

Panic mode recovery

In panic mode recovery, unmatched patterns are deleted from the remaining input, until the lexical analyzer can find a well-formed token at the beginning of what input is left.

(eg.) For instance the string fi is encountered for the first time in a C program in the context:

fi (a== f(x))

A lexical analyzer cannot tell whether f iis a misspelling of the keyword if or an undeclared function identifier.

Since f i is a valid lexeme for the token id, the lexical analyzer will return the token id to the parser.

Local correction

Local correction performs deletion/insertion and/or replacement of any number of symbols in the error detection point.

(eg.) In Pascal, c[i] ‘=’; the scanner deletes the first quote because it cannot legally follow the closing bracket and the parser replaces the resulting’=’ by an assignment statement.

Most of the errors are corrected by local correction.

(eg.) The effects of lexical error recovery might well create a later syntax error, handled by the parser. Consider

· · · for $tnight · · ·

The $ terminates scanning of for. Since no valid token begins with $, it is deleted. Then tnight is scanned as an identifier.

In effect it results,

· · · fortnight · · ·

Which will cause a syntax error? Such false errors are unavoidable, though a syntactic error-repair may help.

Lexical error handling approaches

Lexical errors can be handled by the following actions:

• Deleting one character from the remaining input.

• Inserting a missing character into the remaining input.

• Replacing a character by another character.

• Transposing two adjacent characters.

You’ll also like:

  1. Compiler Construction tools – Compiler Design
  2. Phases of Compiler – Compiler Design
  3. What is Derivations? – Compiler Design
  4. LR Parsers – Compiler Design
  5. Input Buffering – Compiler Design
Next →
← Prev
Like/Subscribe us for latest updates     

About Dinesh Thakur
Dinesh ThakurDinesh Thakur holds an B.C.A, MCDBA, MCSD certifications. Dinesh authors the hugely popular Computer Notes blog. Where he writes how-to guides around Computer fundamental , computer software, Computer programming, and web apps.

Dinesh Thakur is a Freelance Writer who helps different clients from all over the globe. Dinesh has written over 500+ blogs, 30+ eBooks, and 10000+ Posts for all types of clients.


For any type of query or something that you think is missing, please feel free to Contact us.


Primary Sidebar

Compiler Design

Compiler Design

  • CD - Translators
  • CD - Compiler Phases
  • CD - Input Buffering
  • CD - Lexical Analysis
  • CD - Regular Expression
  • CD - LR Parsers
  • CD - Parse Tree
  • CD - Parsing Type
  • CD - Lex
  • CD - Syntax Directed Definition
  • CD - Compiler Construction tools
  • CD - Regular Expression
  • CD - Context Free Grammars
  • CD - Syntax Analysis
  • CD - Grouping of Phases
  • CD - Language Processing System
  • CD - Derivations
  • CD - Error Handling

Other Links

  • Compiler Design - PDF Version

Footer

Basic Course

  • Computer Fundamental
  • Computer Networking
  • Operating System
  • Database System
  • Computer Graphics
  • Management System
  • Software Engineering
  • Digital Electronics
  • Electronic Commerce
  • Compiler Design
  • Troubleshooting

Programming

  • Java Programming
  • Structured Query (SQL)
  • C Programming
  • C++ Programming
  • Visual Basic
  • Data Structures
  • Struts 2
  • Java Servlet
  • C# Programming
  • Basic Terms
  • Interviews

World Wide Web

  • Internet
  • Java Script
  • HTML Language
  • Cascading Style Sheet
  • Java Server Pages
  • Wordpress
  • PHP
  • Python Tutorial
  • AngularJS
  • Troubleshooting

 About Us |  Contact Us |  FAQ

Dinesh Thakur is a Technology Columinist and founder of Computer Notes.

Copyright © 2023. All Rights Reserved.