Reputation: 7486
I am learning lexers in Python. I am using Ply library for lexical analysis on some strings. I have implemented the following lexical analyzer for some of C++ language syntax.
However, I am facing a strange behavior. When I define the COMMENT states function definitions
at the end of other function definitions, the code works fine. If I define COMMENT state functions
before other definitions, I get errors as soon as //
sectoin starts in the input string starts.
WHAT IS THE REASON BEHIND THAT?
import ply.lex as lex
tokens = (
'DLANGLE', # <<
'DRANGLE', # >>
'EQUAL', # =
'STRING', # "144"
'WORD', # 'Welcome' in "Welcome."
'SEMICOLON', # ;
)
t_ignore = ' \t\v\r' # shortcut for whitespace
states = (
('cppcomment', 'exclusive'), # <!--
)
def t_cppcomment(t): # definition here causes errors
r'//'
print 'MyCOm:',t.value
t.lexer.begin('cppcomment');
def t_cppcomment_end(t):
r'\n'
t.lexer.begin('INITIAL');
def t_cppcomment_error(t):
print "Error FOUND"
t.lexer.skip(1)
def t_DLANGLE(t):
r'<<'
print 'MyLAN:',t.value
return t
def t_DRANGLE(t):
r'>>'
return t
def t_SEMICOLON(t):
r';'
print 'MySemi:',t.value
return t;
def t_EQUAL(t):
r'='
return t
def t_STRING(t):
r'"[^"]*"'
t.value = t.value[1:-1] # drop "surrounding quotes"
print 'MyString:',t.value
return t
def t_WORD(t):
r'[^ <>\n]+'
print 'MyWord:',t.value
return t
webpage = "cout<<\"Hello World\"; // this comment"
htmllexer = lex.lex()
htmllexer.input(webpage)
while True:
tok = htmllexer.token()
if not tok: break
print tok
Regards
Upvotes: 0
Views: 848
Reputation: 7486
Just figured it out. As I have defined comment state as exclusive
, it won't use the inclusive
state modules (if comment modules are defined at the top, otherwise it uses it for some reason). So you will have redefine all the modules for comment state again. Therefore ply provides error() modules for skipping characters for which specific modules are not defined.
Upvotes: 1
Reputation: 113940
its because you have no rules that accept this
or comment
and really you dont care about whats in the comment you can easilly do something like
t_cppcomment_ANYTHING = '[^\r\n]'
just below your t_ignore
rule
Upvotes: 0