Erotemic
Erotemic

Reputation: 5248

How to parse python code line-by-line until an expression is complete

I have a multiline string of text, and I want to parse a portion of python code in a line-by-line manner, such that I have a list of string with each item representing its own python statement.

Unfortunately, I just built an AST of the entire text because portions of the text will not contain valid Python syntax. Each python statement may also span multiple lines. However, for each valid statement, I do know which line it starts on. However, I can't distinguish between invalid syntax and continuations of previous lines.

For example I might have something like this (I'll add a comment denoting which lines I know start valid syntax, and which lines are either invalid syntax or a continuation of the previous statement)

foo = bar()                 # valid-start 
this = (                    # valid-start
'perfectly valid syntax'    # unknown
)                           # unknown
44x=but-this-is-bad-syntax  # unknown

The desired output here is a list of tuples, with the first item denoting if the statement is valid python or junk, and the second item being the text corresponding to that statement.

[
    ('PY', 'foo = bar()'),
    ('PY', """this = (                    
    'perfectly valid syntax'    
    )"""),                           
    ('JUNK', '44x=but-this-is-bad-syntax')
]

One solution I considered was checking to see if the parenthesis were balanced, but this becomes tricky when strings are involved (I also haven't convinced myself this works in all cases yet).

foo = bar()                 # valid-start 
this = '''                  # valid-start
    this is still ()        # unknown
    perfectly valid )))     # unknown
'''                         # unknown
z = '''                     # unknown
  even though this is valid # unknown
  syntax, I don't want this # unknown
  line grouped'''           # unknown
44x=but-this-is-bad-syntax  # unknown
a = 1                       # valid-start

should produce something like this output:

[
    ('PY', 'foo = bar()'),
    ('PY', """this = '''                  
    this is still ()        
    perfectly valid )))     
    '''"""),                           
    ('JUNK', "z = '''"),
    ('JUNK', "even though this is valid"),
    ('JUNK', "syntax, I don't want this"),
    ('JUNK', "line grouped'''"),
    ('JUNK', "line grouped"),
    ('JUNK', "44x=but-this-is-bad-syntax"),
    ('PY', "a = 1"),
]

Note in the last example, the line starting with z = ''' is marked as unknown. Even though continuing with it would still produce valid syntax, I want to stop parsing the statement starting with this = ''' exactly once it becomes valid syntax, (i.e. z = ''' would not be included)

Does anyone have an idea on how this might be done?

Would a pyparsing solution, that simply checks for balanced parenthesis while taking strings into account be sufficient? The idea is I would define a grammar that accepts balanced parenthesis / square brackets / curly braces, where the body the nesting could be any sequence of characters or a string (which may contain parens, but those wont be counted towards balance). Then, I would parse lines with this grammar until the reconstructed lines exactly equaled the original lines.

Does anyone see a problem with the previous approach / does anyone have a simpler method for doing this that doesn't involve a dependency on pyparsing?

EDIT

Based on the answer from @rici I came up with a function that can take a list of lines, and returns True if the lines form a complete statement and False otherwise.

import tokenize
from six.moves import cStringIO as StringIO

def is_complete_statement(lines):
    """
    Checks if the lines form a complete python statment.
    """
    try:
        stream = StringIO()
        stream.write('\n'.join(lines))
        stream.seek(0)
        for t in tokenize.generate_tokens(stream.readline):
            pass
    except tokenize.TokenError as ex:
        message = ex.args[0]
        if message.startswith('EOF in multi-line'):
            return False
        raise
    else:
        return True

Upvotes: 5

Views: 2850

Answers (1)

rici
rici

Reputation: 241911

The standard Python library includes modules which both tokenize and parse Python input. Even if your use case is not suitable for the built-in Python parser (module AST), you might well find that the tokenize module is useful. (For example, it correctly tokenizes string literals.)

Here's a simple demonstration in Python 2.7:

$ cat tokenize.py
from sys import stdin
from tokenize import generate_tokens
from token import tok_name
for t in generate_tokens(stdin.readline):
     print (tok_name[t[0]], t[1])
$ python tokenize.py <<"EOF"
> foo = bar()
> this = '''
>    this is still ()
>    perfectly valid )))
> '''
> if not True:
>    print "false"
> 44x=this-is-bad-syntax but it can be tokenized
> a = 1
> EOF
('NAME', 'foo')
('OP', '=')
('NAME', 'bar')
('OP', '(')
('OP', ')')
('NEWLINE', '\n')
('NAME', 'this')
('OP', '=')
('STRING', "'''\n   this is still ()\n   perfectly valid )))\n'''")
('NEWLINE', '\n')
('NAME', 'if')
('NAME', 'not')
('NAME', 'True')
('OP', ':')
('NEWLINE', '\n')
('INDENT', '   ')
('NAME', 'print')
('STRING', '"false"')
('NEWLINE', '\n')
('DEDENT', '')
('NUMBER', '44')
('NAME', 'x')
('OP', '=')
('NAME', 'this')
('OP', '-')
('NAME', 'is')
('OP', '-')
('NAME', 'bad')
('OP', '-')
('NAME', 'syntax')
('NAME', 'but')
('NAME', 'it')
('NAME', 'can')
('NAME', 'be')
('NAME', 'tokenized')
('NEWLINE', '\n')
('NAME', 'a')
('OP', '=')
('NUMBER', '1')
('NEWLINE', '\n')
('ENDMARKER', '')

Upvotes: 3

Related Questions