Does this code correctly identify python strings

Question

I have some code which I think should return all parts of a python statement that are not in strings. However, I'm not sure that is as rigorous as I would like. Basically, it just finds the next string delimiter and stays in the "string" state until it is closed by the same delimiter. Is there anything wrong with what I have done for some weird case that I have not thought of? Will it be in any way inconsistent with what python does?

# String delimiters in order of precedence
string_delims = ["'''",'"""',"'",'"']

# Get non string parts of a statement
def get_non_string(text):

    out = ""
    state = None

    while True:

        # not in string
        if state == None:
            vals = [text.find(s) for s in string_delims]

            # None will only be reached if all are -1 (i.e. no substring)
            for val,delim in zip(vals+[None], string_delims+[None]):
                if val == None:
                    out += text
                    return out

                if val >= 0:
                    i = val
                    state = delim
                    break

            out += text[:i]
            text = text[i+len(delim):]

        else:
            i = text.find(state)
            if i < 0:
                raise SyntaxError("Symobolic Subsystem: EOL while scanning string literal")
            text = text[i+len(delim)]
            state = None

Example Input:

get_non_string("hello'''everyone'''!' :)'''")

Example Output:

hello!

unutbu · Accepted Answer

Python can tokenize Python code:

import tokenize
import token
import io
import collections

class Token(collections.namedtuple('Token', 'num val start end line')):
    @property
    def name(self):
        return token.tok_name[self.num]

def get_non_string(text):
    result = []
    for tok in tokenize.generate_tokens(io.BytesIO(text).readline):
        tok = Token(*tok)
        # print(tok.name, tok.val)
        if tok.name != 'STRING':
            result.append(tok.val)
    return ''.join(result)    

print(get_non_string("hello'''everyone'''!' :)'''"))

yields

hello!

The heavy lifting is done by tokenize.generate_tokens.

Does this code correctly identify python strings

Answers (2)

Related Questions