Umer Farooq
Umer Farooq

Reputation: 7486

Lexical analysis

I am learning lexers in Python. I am using Ply library for lexical analysis on some strings. I have implemented the following lexical analyzer for some of C++ language syntax.

However, I am facing a strange behavior. When I define the COMMENT states function definitions at the end of other function definitions, the code works fine. If I define COMMENT state functions before other definitions, I get errors as soon as // sectoin starts in the input string starts.

WHAT IS THE REASON BEHIND THAT?

import ply.lex as lex

tokens = (
        'DLANGLE',       # <<
        'DRANGLE',       # >>
        'EQUAL',        # =
        'STRING',       # "144"
        'WORD',         # 'Welcome' in "Welcome."
        'SEMICOLON',    # ;

)

t_ignore                = ' \t\v\r' # shortcut for whitespace


states = (
        ('cppcomment', 'exclusive'),   # <!--
)



def t_cppcomment(t): # definition here causes errors
    r'//'
    print 'MyCOm:',t.value

    t.lexer.begin('cppcomment');



def t_cppcomment_end(t):
    r'\n'
    t.lexer.begin('INITIAL');


def t_cppcomment_error(t):
    print "Error FOUND"
    t.lexer.skip(1)

def t_DLANGLE(t):

    r'<<'
    print 'MyLAN:',t.value
    return t

def t_DRANGLE(t):
    r'>>'
    return t

def t_SEMICOLON(t):

    r';'
    print 'MySemi:',t.value
    return t;

def t_EQUAL(t):
        r'='
        return t

def t_STRING(t):
        r'"[^"]*"'
        t.value = t.value[1:-1] # drop "surrounding quotes"
        print 'MyString:',t.value
        return t

def t_WORD(t):
        r'[^ <>\n]+'
        print 'MyWord:',t.value
        return t




webpage = "cout<<\"Hello World\"; // this comment"
htmllexer = lex.lex()
htmllexer.input(webpage)
while True:
        tok = htmllexer.token()
        if not tok: break
        print tok

Regards

Upvotes: 0

Views: 848

Answers (2)

Umer Farooq
Umer Farooq

Reputation: 7486

Just figured it out. As I have defined comment state as exclusive, it won't use the inclusive state modules (if comment modules are defined at the top, otherwise it uses it for some reason). So you will have redefine all the modules for comment state again. Therefore ply provides error() modules for skipping characters for which specific modules are not defined.

Upvotes: 1

Joran Beasley
Joran Beasley

Reputation: 113940

its because you have no rules that accept this or comment and really you dont care about whats in the comment you can easilly do something like

t_cppcomment_ANYTHING = '[^\r\n]'

just below your t_ignore rule

Upvotes: 0

Related Questions