Python tokenize: token position

Question

Python's tokenize returns all the found tokens' position as two tuples of (startRow, startCol) and (endRow, endCol).

Is there a way to return the positions as the offsets from the beginning of the string? That is, I would like to get rid of (row, col) in favor of just "offset".

Amber · Accepted Answer

There isn't one built-in to tokenize.

If you had access to the same set of lines being used by the tokenizer, you could run through and store the accumulated "total length of lines before line X" into a list, and then use that to convert the row values into additive offsets.

For instance:

import tokenize

def tokens_with_offset(path):
    line_offsets = []
    line_offset_accum = 0
    with open(path) as f:
        for line in f:
            line_offsets.append(line_offset_accum)
            line_offset_accum += len(line)

    with open(path) as f:
        for ttype, tstring, tbegin, tend, tline in tokenize.generate_tokens(f.readline):
            offset_begin = line_offsets[tbegin[0]] + tbegin[1]
            offset_end = line_offsets[tend[0]] + tend[1]
            yield ttype, tstring, offset_begin, offset_end, tline

(Note: haven't tested this code, it's more as an example of the general concept.)

Python tokenize: token position

Answers (1)

Related Questions