Reputation: 1
I see some strange behaviour when using the tokenize
module
import tokenize, io, sys
from contextlib import closing
source=u"""
for i in range(10):
print "HELLO"
if i==5:
print "bingo"
"""
def parse():
with closing(io.StringIO(source)) as f:
for type, token, (srow, scol), (erow, ecol), line in tokenize.generate_tokens(f.readline):
print("%d,%d-%d,%d:\t%s\t%s" % (srow, scol, erow, ecol, tokenize.tok_name[type], repr(token)))
yield type, token, (srow, scol), (erow, ecol), line
for token in tokenize.untokenize(parse()):
sys.stdout.write(token)
The INDENT before the if
is missing in the output to console. It happens for both python2 and 3.
Is this a known bug or am I using the module in a wrong way?
3,1-3,6: NAME u'print'
3,7-3,14: STRING u'"HELLO"'
3,14-3,15: NEWLINE u'\n'
4,1-4,3: NAME u'if'
4,4-4,5: NAME u'i'
4,5-4,7: OP u'=='
I use tabs for indentation. When I replace the tabs by 4 spaces, I obtain the correct result
3,4-3,9: NAME u'print'
3,10-3,17: STRING u'"HELLO"'
3,17-3,18: NEWLINE u'\n'
4,4-4,6: NAME u'if'
4,7-4,8: NAME u'i'
4,8-4,10: OP u'=='
4,10-4,11: NUMBER u'5'
4,11-4,12: OP u':'
The difference is the start column of if
which is 1 when using tabs and 4 when using spaces.
It seems to be a bug in the 'untokenize' function that seems to output spaces instead of tabs.
Can someone confirm?
Upvotes: 0
Views: 76
Reputation: 11310
Everything is exactly as it should be. Column index is 0 based. One tab
is one character so if
is correctly detected in column 1. When you change this tab
to 4 spaces
there are indeed 4 characters, so if
is detected in column 4.
Proof from the output of your code:
1,0-1,1: NL u'\n'
2,0-2,3: NAME u'for'
As you can see, both linefeed
in the first line and for
in the second line are in column 0.
You might get confused because line numbers are 1 based.
Footnote: You should use spaces for indentation. This is a convention which ensures consistent display of indents on another developer's machine. Displayed tab width may be adjusted and space width is just one character.
EDIT:
After OP's comments arguing that there is a problem with tokenize
and untokenize
I must add the following note.
As official documentation says:
The result is guaranteed to tokenize back to match the input so that the conversion is lossless and round-trips are assured. The guarantee applies only to the token type and token string as the spacing between tokens (column positions) may change.
Upvotes: 1