Reputation: 11468
Python's tokenize returns all the found tokens' position as two tuples of (startRow, startCol) and (endRow, endCol).
Is there a way to return the positions as the offsets from the beginning of the string? That is, I would like to get rid of (row, col) in favor of just "offset".
Upvotes: 2
Views: 1410
Reputation: 526743
There isn't one built-in to tokenize
.
If you had access to the same set of lines being used by the tokenizer, you could run through and store the accumulated "total length of lines before line X" into a list, and then use that to convert the row values into additive offsets.
For instance:
import tokenize
def tokens_with_offset(path):
line_offsets = []
line_offset_accum = 0
with open(path) as f:
for line in f:
line_offsets.append(line_offset_accum)
line_offset_accum += len(line)
with open(path) as f:
for ttype, tstring, tbegin, tend, tline in tokenize.generate_tokens(f.readline):
offset_begin = line_offsets[tbegin[0]] + tbegin[1]
offset_end = line_offsets[tend[0]] + tend[1]
yield ttype, tstring, offset_begin, offset_end, tline
(Note: haven't tested this code, it's more as an example of the general concept.)
Upvotes: 1