Reputation: 2073
I need to determine if a string represents a valid python identifier. Since python 3 identifiers support obscure unicode functionality, and python syntax might change across releases, I decided to avoid manual parsing. Unfortunately my attempts at utilizing python's internal interfaces don't seem to work:
I. function compile
>>> string = "a = 5; b "
>>> test = "{} = 5"
>>> compile(test.format(string), "<string>", "exec")
<code object <module> at 0xb71b4d90, file "<string>", line 1>
Clearly test
can't force compile to use ast.Name as the root of the AST.
Next I attempt using the modules ast
and parser
. These modules are intended to derive a string, rather than determining if a string matches a particular derivation, but I figure they might be helpful anyway.
II. module ast
>>> a=ast.Module(body=[ast.Expr(value=ast.Name(id='1a', ctx=ast.Load()))])
>>> af = ast.fix_missing_locations(a)
>>> c = compile(af, "<string>", "exec")
>>> exec(c)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<string>", line 1, in <module>
NameError: name '1a' is not defined
OK, clearly Name isn't parsing '1a' for correctness. Perhaps this step happens earlier, in the parse phase.
III. module parser
>>> p = parser.suite("a")
>>> t = parser.st2tuple(p)
>>> t
(257, (268, (269, (270, (271, (272, (302, (306, (307, (308, (309, (312, (313, (314, (315, (316, (317, (318, (319, (320, (1, 'a')))))))))))))))))), (4, ''))), (4, ''), (0, ''))
>>>
>>> t = (257, (268, (269, (270, (271, (272, (302, (306, (307, (308, (309, (312, (313, (314, (315, (316, (317, (318, (319, (320, (1, '1a')))))))))))))))))), (4, ''))), (4, ''), (0, ''))
>>> p = parser.sequence2st(t)
>>> c = parser.compilest(p)
>>> exec(c)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<syntax-tree>", line 0, in <module>
NameError: name '1a' is not defined
OK, still not being checked... why? Quick check of python's full grammar specification shows that NAME is not defined. If these checks are performed by the bytecode compiler, shouldn't 1a
have been caught?
I'm starting to suspect python exposes no functionality towards this goal. I'm also curious why some attempts failed.
Upvotes: 0
Views: 778
Reputation: 241971
You don't need to parse, just tokenize, and -- if you care -- test if the returned NAME
is a keyword
Example, partly adapted from the linked documentation:
>>> import tokenize
>>> from io import BytesIO
>>> from keyword import iskeyword
>>> s = "def twoπ(a,b):"
>>> g = tokenize.tokenize(BytesIO(s.encode("utf-8")).readline)
>>> for toktype, tokval, st, end, _ in g:
... if toktype == tokenize.NAME and iskeyword(tokval):
... print ("KEYWORD ", tokval)
... else:
... print(toktype, tokval)
...
56 utf-8
KEYWORD def
1 twoπ
52 (
1 a
52 ,
1 b
52 )
52 :
0
You'll always get an ENCODING (56) token at the beginning of the input, and an ENDMARKER (0) at the end.
Upvotes: 1
Reputation: 13809
I'm not sure where you were going with your compile
example, but if you compile
just the potential identifer for eval
, it exposes what is going on.
>>> dis(compile("1", "<string>", "eval"))
1 0 LOAD_CONST 0 (1)
3 RETURN_VALUE
>>> dis(compile("a", "<string>", "eval"))
1 0 LOAD_NAME 0 (a)
3 RETURN_VALUE
>>> dis(compile("1a", "<string>", "eval"))
File "<string>", line 1
1a
^
SyntaxError: unexpected EOF while parsing
>>> dis(compile("你好", "<string>", "eval"))
1 0 LOAD_NAME 0 (你好)
3 RETURN_VALUE
It would require more testing before using for real (for edge cases), but getting a LOAD_NAME
opcode back is indicative. Failure states can include both an exception and getting a different opcode so you have to check for both.
Upvotes: 1