Strange behaviour of the tokenize module

Question

I see some strange behaviour when using the tokenize module

import tokenize, io, sys
from contextlib import closing
source=u"""
for i in range(10):
    print "HELLO"
    if i==5:
        print "bingo"
"""
def parse():
    with closing(io.StringIO(source)) as f:
        for type, token, (srow, scol), (erow, ecol), line in tokenize.generate_tokens(f.readline):
            print("%d,%d-%d,%d:	%s	%s" % (srow, scol, erow, ecol, tokenize.tok_name[type], repr(token)))
            yield type, token, (srow, scol), (erow, ecol), line

for token in tokenize.untokenize(parse()):
    sys.stdout.write(token)

The INDENT before the if is missing in the output to console. It happens for both python2 and 3. Is this a known bug or am I using the module in a wrong way?

3,1-3,6:    NAME    u'print'
3,7-3,14:   STRING  u'"HELLO"'
3,14-3,15:  NEWLINE u'
'
4,1-4,3:    NAME    u'if'
4,4-4,5:    NAME    u'i'
4,5-4,7:    OP  u'=='

I use tabs for indentation. When I replace the tabs by 4 spaces, I obtain the correct result

3,4-3,9:    NAME    u'print'
3,10-3,17:  STRING  u'"HELLO"'
3,17-3,18:  NEWLINE u'
'
4,4-4,6:    NAME    u'if'
4,7-4,8:    NAME    u'i'
4,8-4,10:   OP  u'=='
4,10-4,11:  NUMBER  u'5'
4,11-4,12:  OP  u':'

The difference is the start column of if which is 1 when using tabs and 4 when using spaces. It seems to be a bug in the 'untokenize' function that seems to output spaces instead of tabs.

Can someone confirm?

Strange behaviour of the tokenize module

Answers (1)

Related Questions