user19087
user19087

Reputation: 2073

Parse Python Identifier

I need to determine if a string represents a valid python identifier. Since python 3 identifiers support obscure unicode functionality, and python syntax might change across releases, I decided to avoid manual parsing. Unfortunately my attempts at utilizing python's internal interfaces don't seem to work:

I. function compile

>>> string = "a = 5; b "
>>> test = "{} = 5"
>>> compile(test.format(string), "<string>", "exec")
<code object <module> at 0xb71b4d90, file "<string>", line 1>

Clearly test can't force compile to use ast.Name as the root of the AST.

Next I attempt using the modules ast and parser. These modules are intended to derive a string, rather than determining if a string matches a particular derivation, but I figure they might be helpful anyway.

II. module ast

>>> a=ast.Module(body=[ast.Expr(value=ast.Name(id='1a', ctx=ast.Load()))])
>>> af = ast.fix_missing_locations(a)
>>> c = compile(af, "<string>", "exec")
>>> exec(c)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 1, in <module>
NameError: name '1a' is not defined

OK, clearly Name isn't parsing '1a' for correctness. Perhaps this step happens earlier, in the parse phase.

III. module parser

>>> p = parser.suite("a")
>>> t = parser.st2tuple(p)
>>> t
(257, (268, (269, (270, (271, (272, (302, (306, (307, (308, (309, (312, (313, (314, (315, (316, (317, (318, (319, (320, (1, 'a')))))))))))))))))), (4, ''))), (4, ''), (0, ''))
>>> 
>>> t = (257, (268, (269, (270, (271, (272, (302, (306, (307, (308, (309, (312, (313, (314, (315, (316, (317, (318, (319, (320, (1, '1a')))))))))))))))))), (4, ''))), (4, ''), (0, ''))
>>> p = parser.sequence2st(t)
>>> c = parser.compilest(p)
>>> exec(c)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<syntax-tree>", line 0, in <module>
NameError: name '1a' is not defined

OK, still not being checked... why? Quick check of python's full grammar specification shows that NAME is not defined. If these checks are performed by the bytecode compiler, shouldn't 1a have been caught?

I'm starting to suspect python exposes no functionality towards this goal. I'm also curious why some attempts failed.

Upvotes: 0

Views: 778

Answers (2)

rici
rici

Reputation: 241971

You don't need to parse, just tokenize, and -- if you care -- test if the returned NAME is a keyword

Example, partly adapted from the linked documentation:

>>> import tokenize
>>> from io import BytesIO
>>> from keyword import iskeyword
>>> s = "def twoπ(a,b):"
>>> g = tokenize.tokenize(BytesIO(s.encode("utf-8")).readline)
>>> for toktype, tokval, st, end, _ in g:
...   if toktype == tokenize.NAME and iskeyword(tokval):
...     print ("KEYWORD ", tokval)
...   else:
...     print(toktype, tokval)
... 
56 utf-8
KEYWORD  def
1 twoπ
52 (
1 a
52 ,
1 b
52 )
52 :
0 

You'll always get an ENCODING (56) token at the beginning of the input, and an ENDMARKER (0) at the end.

Upvotes: 1

Jason S
Jason S

Reputation: 13809

I'm not sure where you were going with your compile example, but if you compile just the potential identifer for eval, it exposes what is going on.

>>> dis(compile("1", "<string>", "eval"))

  1           0 LOAD_CONST               0 (1)
              3 RETURN_VALUE

>>> dis(compile("a", "<string>", "eval"))

  1           0 LOAD_NAME                0 (a)
              3 RETURN_VALUE

>>> dis(compile("1a", "<string>", "eval"))

  File "<string>", line 1
    1a
     ^
SyntaxError: unexpected EOF while parsing

>>> dis(compile("你好", "<string>", "eval"))

  1           0 LOAD_NAME                0 (你好)
              3 RETURN_VALUE

It would require more testing before using for real (for edge cases), but getting a LOAD_NAME opcode back is indicative. Failure states can include both an exception and getting a different opcode so you have to check for both.

Upvotes: 1

Related Questions