Reputation: 5248
I have a multiline string of text, and I want to parse a portion of python code in a line-by-line manner, such that I have a list of string with each item representing its own python statement.
Unfortunately, I just built an AST of the entire text because portions of the text will not contain valid Python syntax. Each python statement may also span multiple lines. However, for each valid statement, I do know which line it starts on. However, I can't distinguish between invalid syntax and continuations of previous lines.
For example I might have something like this (I'll add a comment denoting which lines I know start valid syntax, and which lines are either invalid syntax or a continuation of the previous statement)
foo = bar() # valid-start
this = ( # valid-start
'perfectly valid syntax' # unknown
) # unknown
44x=but-this-is-bad-syntax # unknown
The desired output here is a list of tuples, with the first item denoting if the statement is valid python or junk, and the second item being the text corresponding to that statement.
[
('PY', 'foo = bar()'),
('PY', """this = (
'perfectly valid syntax'
)"""),
('JUNK', '44x=but-this-is-bad-syntax')
]
One solution I considered was checking to see if the parenthesis were balanced, but this becomes tricky when strings are involved (I also haven't convinced myself this works in all cases yet).
foo = bar() # valid-start
this = ''' # valid-start
this is still () # unknown
perfectly valid ))) # unknown
''' # unknown
z = ''' # unknown
even though this is valid # unknown
syntax, I don't want this # unknown
line grouped''' # unknown
44x=but-this-is-bad-syntax # unknown
a = 1 # valid-start
should produce something like this output:
[
('PY', 'foo = bar()'),
('PY', """this = '''
this is still ()
perfectly valid )))
'''"""),
('JUNK', "z = '''"),
('JUNK', "even though this is valid"),
('JUNK', "syntax, I don't want this"),
('JUNK', "line grouped'''"),
('JUNK', "line grouped"),
('JUNK', "44x=but-this-is-bad-syntax"),
('PY', "a = 1"),
]
Note in the last example, the line starting with z = '''
is marked as unknown. Even though continuing with it would still produce valid syntax, I want to stop parsing the statement starting with this = '''
exactly once it becomes valid syntax, (i.e. z = '''
would not be included)
Does anyone have an idea on how this might be done?
Would a pyparsing solution, that simply checks for balanced parenthesis while taking strings into account be sufficient? The idea is I would define a grammar that accepts balanced parenthesis / square brackets / curly braces, where the body the nesting could be any sequence of characters or a string (which may contain parens, but those wont be counted towards balance). Then, I would parse lines with this grammar until the reconstructed lines exactly equaled the original lines.
Does anyone see a problem with the previous approach / does anyone have a simpler method for doing this that doesn't involve a dependency on pyparsing?
Based on the answer from @rici I came up with a function that can take a list of lines, and returns True if the lines form a complete statement and False otherwise.
import tokenize
from six.moves import cStringIO as StringIO
def is_complete_statement(lines):
"""
Checks if the lines form a complete python statment.
"""
try:
stream = StringIO()
stream.write('\n'.join(lines))
stream.seek(0)
for t in tokenize.generate_tokens(stream.readline):
pass
except tokenize.TokenError as ex:
message = ex.args[0]
if message.startswith('EOF in multi-line'):
return False
raise
else:
return True
Upvotes: 5
Views: 2850
Reputation: 241911
The standard Python library includes modules which both tokenize and parse Python input. Even if your use case is not suitable for the built-in Python parser (module AST
), you might well find that the tokenize
module is useful. (For example, it correctly tokenizes string literals.)
Here's a simple demonstration in Python 2.7:
$ cat tokenize.py
from sys import stdin
from tokenize import generate_tokens
from token import tok_name
for t in generate_tokens(stdin.readline):
print (tok_name[t[0]], t[1])
$ python tokenize.py <<"EOF"
> foo = bar()
> this = '''
> this is still ()
> perfectly valid )))
> '''
> if not True:
> print "false"
> 44x=this-is-bad-syntax but it can be tokenized
> a = 1
> EOF
('NAME', 'foo')
('OP', '=')
('NAME', 'bar')
('OP', '(')
('OP', ')')
('NEWLINE', '\n')
('NAME', 'this')
('OP', '=')
('STRING', "'''\n this is still ()\n perfectly valid )))\n'''")
('NEWLINE', '\n')
('NAME', 'if')
('NAME', 'not')
('NAME', 'True')
('OP', ':')
('NEWLINE', '\n')
('INDENT', ' ')
('NAME', 'print')
('STRING', '"false"')
('NEWLINE', '\n')
('DEDENT', '')
('NUMBER', '44')
('NAME', 'x')
('OP', '=')
('NAME', 'this')
('OP', '-')
('NAME', 'is')
('OP', '-')
('NAME', 'bad')
('OP', '-')
('NAME', 'syntax')
('NAME', 'but')
('NAME', 'it')
('NAME', 'can')
('NAME', 'be')
('NAME', 'tokenized')
('NEWLINE', '\n')
('NAME', 'a')
('OP', '=')
('NUMBER', '1')
('NEWLINE', '\n')
('ENDMARKER', '')
Upvotes: 3