xorsyst
xorsyst

Reputation: 8237

How to split a string on whitespace and retain offsets and lengths of words

I need to split a string into words, but also get the starting and ending offset of the words. So, for example, if the input string is:

input_string = "ONE  ONE ONE   \t TWO TWO ONE TWO TWO THREE"

I want to get:

[('ONE', 0, 2), ('ONE', 5, 7), ('ONE', 9, 11), ('TWO', 17, 19), ('TWO', 21, 23),
 ('ONE', 25, 27), ('TWO', 29, 31), ('TWO', 33, 35), ('THREE', 37, 41)]

I've got some working code that does this using input_string.split and calls to .index, but it's slow. I tried to code it by manually iterating through the string, but that was slower still. Does anyone have a fast algorithm for this?

Here are my two versions:

def using_split(line):
    words = line.split()
    offsets = []
    running_offset = 0
    for word in words:
        word_offset = line.index(word, running_offset)
        word_len = len(word)
        running_offset = word_offset + word_len
        offsets.append((word, word_offset, running_offset - 1))

    return offsets

def manual_iteration(line):
    start = 0
    offsets = []
    word = ''
    for off, char in enumerate(line + ' '):
        if char in ' \t\r\n':
            if off > start:
                offsets.append((word, start, off - 1))
            start = off + 1
            word = ''
        else:
            word += char

    return offsets

By using timeit, "using_split" is the fastest, followed by "manual_iteration", then the slowest so far is using re.finditer as suggested below.

Upvotes: 15

Views: 5930

Answers (10)

Dima Tisnek
Dima Tisnek

Reputation: 11781

It occurs to me the python loop is the slow operation here, thus I started on bitmaps, I got this far and it's still fast, but I can't figure out a loop-free way to get start/stop indices out it:

import string
table = "".join([chr(i).isspace() and "0" or "1" for i in range(256)])
def indexed6(line):
    binline = string.translate(line, table)
    return int(binline, 2) ^ int(binline+"0", 2)

The returned integer has bits set for each start position and each stop+1 position.

P.S. zip() is relatively slow: fast enough to be used once, too slow to be used 3 times.

Upvotes: 0

stevegt
stevegt

Reputation: 1742

I was able to get about a 35% speedup in a few minutes by outright cheating: I converted your using_split() function into a C-based python module using cython. This was the first excuse I've ever had to try cython, and I see that it's pretty easy and rewarding -- see below.

Punting into C was a last resort: First I spent a few hours messing around trying to find a faster algorithm than your using_split() version. The thing is, the native python str.split() is surprisingly fast, faster than anything I tried using numpy or re, for instance. So even though you're scanning the string twice, str.split() is fast enough that it doesn't seem to matter, at least not for this particular test data.

To use cython, I put your parser in a file named parser.pyx:

===================== parser.pyx ==============================
def using_split(line):
    words = line.split()
    offsets = []
    running_offset = 0
    for word in words:
        word_offset = line.index(word, running_offset)
        word_len = len(word)
        running_offset = word_offset + word_len
        offsets.append((word, word_offset, running_offset - 1))
    return offsets
===============================================================

Then I ran this to install cython (assuming a debian-ish Linux box):

sudo apt-get install cython

Then I called the parser from this python script:

================== using_cython.py ============================

#!/usr/bin/python

import pyximport; pyximport.install()
import parser

input_string = "ONE  ONE ONE   \t TWO TWO ONE TWO TWO THREE"

def parse():
    return parser.using_split(input_string)

===============================================================

To test, I ran this:

python -m timeit "import using_cython; using_cython.parse();"

On my machine, your pure-python using_split() function averages around 8.5 usec runtime, while my cython version averages around 5.5 usec.

More details at http://docs.cython.org/src/userguide/source_files_and_compilation.html

Upvotes: 2

Rusty Rob
Rusty Rob

Reputation: 17173

Here are a couple of ideas which you could profile to see if they are fast enough:

input_string = "".join([" ","ONE  ONE ONE   \t TWO TWO ONE TWO TWO THREE"," "])

#pre processing
from itertools import chain
stuff = list(chain(*zip(range(len(input_string)),range(len(input_string)))))
print stuff
stuff = iter(stuff)
next(stuff)

#calculate
switches = (i for i in range(0,len(input_string)-1) if (input_string[next(stuff)] in " \t\r\n") ^ (input_string[next(stuff)] in " \t\r\n"))
print [(word,next(switches),next(switches)-1) for word in input_string.split()]


#pre processing
from itertools import chain
stuff = list(chain(*zip(range(len(input_string)),range(len(input_string)))))
print stuff
stuff = iter(stuff)
next(stuff)


#calculate
switches = (i for i in range(0,len(input_string)-1) if (input_string[next(stuff)] in " \t\r\n") ^ (input_string[next(stuff)] in " \t\r\n"))
print [(input_string[i:j+1],i,j-1) for i,j in zip(switches,switches)]

Upvotes: 0

Rusty Rob
Rusty Rob

Reputation: 17173

Warning, the speed of this solution is limited by the speed of light:

def get_word_context(input_string):
    start = 0
    for word in input_string.split():
        c = word[0] #first character
        start = input_string.find(c,start)
        end = start + len(word) - 1
        yield (word,start,end)
        start = end + 2

print list(get_word_context("ONE  ONE ONE   \t TWO TWO ONE TWO TWO THREE"))

[('ONE', 0, 2), ('ONE', 5, 7), ('ONE', 9, 11), ('TWO', 17, 19), ('TWO', 21, 23), ('ONE', 25, 27), ('TWO', 29, 31), ('TWO', 33, 35), ('THREE', 37, 41)]

Upvotes: 3

Joel Cornett
Joel Cornett

Reputation: 24788

This seems to work pretty quickly:

tuple_list = [(match.group(), match.start(), match.end()) for match in re.compile("\S+").finditer(input_string)]

Upvotes: 1

Markus
Markus

Reputation: 1636

Here you have some c-oriented approach, that only iterates once over the complete string. You can also define your own seperators. Tested and works, but could probably be cleaner.

def mySplit(myString, mySeperators):
    w = []
    o = 0
    iW = False
    word = [None, None,None]
    for i,c in enumerate(myString):
        if not c in mySeperators:
            if not iW:
                word[1]=i
                iW = True
        if iW == True and c in mySeperators:
            word[2]=i-1
            word[0] = myString[word[1]:i]
            w.append(tuple(word))
            word=[None,None,None]
            iW = False
    return w

mySeperators = [" ", "\t"]
myString = "ONE  ONE ONE   \t TWO TWO ONE TWO TWO THREE"
splitted = mySplit(myString, mySeperators)
print splitted

Upvotes: 0

Iain Rist
Iain Rist

Reputation: 996

The following ideas may result in speed-ups:

  1. Use a deque rather than a list to store the offsets and converting to a list only on return. Deque appends do not result in memory moves like a list append does.
  2. If the words are known to be shorter than some length, provide this in the index function.
  3. Define your functions in the local dictionary.

Note: I haven't tested these but here is an example

from collections import deque

def using_split(line):
    MAX_WORD_LENGTH = 10
    line_index = line.index

    words = line.split()

    offsets = deque()
    offsets_append = offsets.append

    running_offset = 0

    for word in words:
        word_offset = line_index(word, running_offset, running_offset+MAX_WORD_LENGTH)
        running_offset = word_offset + len(word)
        offsets_append((word, word_offset, running_offset - 1))

    return list(offsets)

Upvotes: 0

aquavitae
aquavitae

Reputation: 19114

The following runs slightly faster - it saves about 30%. All I did was define the functions in advance:

def using_split2(line, _len=len):
    words = line.split()
    index = line.index
    offsets = []
    append = offsets.append
    running_offset = 0
    for word in words:
        word_offset = index(word, running_offset)
        word_len = _len(word)
        running_offset = word_offset + word_len
        append((word, word_offset, running_offset - 1))
    return offsets

Upvotes: 10

NPE
NPE

Reputation: 500317

The following will do it:

import re
s = 'ONE  ONE ONE   \t TWO TWO ONE TWO TWO THREE'
ret = [(m.group(0), m.start(), m.end() - 1) for m in re.finditer(r'\S+', s)]
print(ret)

This produces:

[('ONE', 0, 2), ('ONE', 5, 7), ('ONE', 9, 11), ('TWO', 17, 19), ('TWO', 21, 23),
 ('ONE', 25, 27), ('TWO', 29, 31), ('TWO', 33, 35), ('THREE', 37, 41)]

Upvotes: 25

Fred Foo
Fred Foo

Reputation: 363547

def split_span(s):
    for match in re.finditer(r"\S+", s):
        span = match.span()
        yield match.group(0), span[0], span[1] - 1

Upvotes: 7

Related Questions