user3172811
user3172811

Reputation: 1

Strange behaviour of the tokenize module

I see some strange behaviour when using the tokenize module

import tokenize, io, sys
from contextlib import closing
source=u"""
for i in range(10):
    print "HELLO"
    if i==5:
        print "bingo"
"""
def parse():
    with closing(io.StringIO(source)) as f:
        for type, token, (srow, scol), (erow, ecol), line in tokenize.generate_tokens(f.readline):
            print("%d,%d-%d,%d:\t%s\t%s" % (srow, scol, erow, ecol, tokenize.tok_name[type], repr(token)))
            yield type, token, (srow, scol), (erow, ecol), line

for token in tokenize.untokenize(parse()):
    sys.stdout.write(token)

The INDENT before the if is missing in the output to console. It happens for both python2 and 3. Is this a known bug or am I using the module in a wrong way?

3,1-3,6:    NAME    u'print'
3,7-3,14:   STRING  u'"HELLO"'
3,14-3,15:  NEWLINE u'\n'
4,1-4,3:    NAME    u'if'
4,4-4,5:    NAME    u'i'
4,5-4,7:    OP  u'=='

I use tabs for indentation. When I replace the tabs by 4 spaces, I obtain the correct result

3,4-3,9:    NAME    u'print'
3,10-3,17:  STRING  u'"HELLO"'
3,17-3,18:  NEWLINE u'\n'
4,4-4,6:    NAME    u'if'
4,7-4,8:    NAME    u'i'
4,8-4,10:   OP  u'=='
4,10-4,11:  NUMBER  u'5'
4,11-4,12:  OP  u':'

The difference is the start column of if which is 1 when using tabs and 4 when using spaces. It seems to be a bug in the 'untokenize' function that seems to output spaces instead of tabs.

Can someone confirm?

Upvotes: 0

Views: 76

Answers (1)

ElmoVanKielmo
ElmoVanKielmo

Reputation: 11310

Everything is exactly as it should be. Column index is 0 based. One tab is one character so if is correctly detected in column 1. When you change this tab to 4 spaces there are indeed 4 characters, so if is detected in column 4.
Proof from the output of your code:

1,0-1,1:    NL  u'\n'
2,0-2,3:    NAME    u'for'

As you can see, both linefeed in the first line and for in the second line are in column 0.
You might get confused because line numbers are 1 based.

Footnote: You should use spaces for indentation. This is a convention which ensures consistent display of indents on another developer's machine. Displayed tab width may be adjusted and space width is just one character.

EDIT:
After OP's comments arguing that there is a problem with tokenize and untokenize I must add the following note.
As official documentation says:
The result is guaranteed to tokenize back to match the input so that the conversion is lossless and round-trips are assured. The guarantee applies only to the token type and token string as the spacing between tokens (column positions) may change.

Upvotes: 1

Related Questions