Parse Python Identifier

Question

I need to determine if a string represents a valid python identifier. Since python 3 identifiers support obscure unicode functionality, and python syntax might change across releases, I decided to avoid manual parsing. Unfortunately my attempts at utilizing python's internal interfaces don't seem to work:

I. function compile

>>> string = "a = 5; b "
>>> test = "{} = 5"
>>> compile(test.format(string), "", "exec")
 at 0xb71b4d90, file "", line 1>



Clearly test can't force compile to use ast.Name as the root of the AST.

Next I attempt using the modules ast and parser. These modules are intended to derive a string, rather than determining if a string matches a particular derivation, but I figure they might be helpful anyway.

II. module ast

>>> a=ast.Module(body=[ast.Expr(value=ast.Name(id='1a', ctx=ast.Load()))])
>>> af = ast.fix_missing_locations(a)
>>> c = compile(af, "", "exec")
>>> exec(c)
Traceback (most recent call last):
  File "", line 1, in 
  File "", line 1, in 
NameError: name '1a' is not defined


OK, clearly Name isn't parsing '1a' for correctness. Perhaps this step happens earlier, in the parse phase.

III. module parser

>>> p = parser.suite("a")
>>> t = parser.st2tuple(p)
>>> t
(257, (268, (269, (270, (271, (272, (302, (306, (307, (308, (309, (312, (313, (314, (315, (316, (317, (318, (319, (320, (1, 'a')))))))))))))))))), (4, ''))), (4, ''), (0, ''))
>>> 
>>> t = (257, (268, (269, (270, (271, (272, (302, (306, (307, (308, (309, (312, (313, (314, (315, (316, (317, (318, (319, (320, (1, '1a')))))))))))))))))), (4, ''))), (4, ''), (0, ''))
>>> p = parser.sequence2st(t)
>>> c = parser.compilest(p)
>>> exec(c)
Traceback (most recent call last):
  File "", line 1, in 
  File "", line 0, in 
NameError: name '1a' is not defined


OK, still not being checked... why? Quick check of python's full grammar specification shows that NAME is not defined. If these checks are performed by the bytecode compiler, shouldn't 1a have been caught?

I'm starting to suspect python exposes no functionality towards this goal. I'm also curious why some attempts failed.

rici · Accepted Answer

You don't need to parse, just tokenize, and -- if you care -- test if the returned NAME is a keyword

Example, partly adapted from the linked documentation:

>>> import tokenize
>>> from io import BytesIO
>>> from keyword import iskeyword
>>> s = "def twoπ(a,b):"
>>> g = tokenize.tokenize(BytesIO(s.encode("utf-8")).readline)
>>> for toktype, tokval, st, end, _ in g:
...   if toktype == tokenize.NAME and iskeyword(tokval):
...     print ("KEYWORD ", tokval)
...   else:
...     print(toktype, tokval)
... 
56 utf-8
KEYWORD  def
1 twoπ
52 (
1 a
52 ,
1 b
52 )
52 :
0

You'll always get an ENCODING (56) token at the beginning of the input, and an ENDMARKER (0) at the end.

Parse Python Identifier

Answers (2)

Related Questions