Reputation: 8237
I need to split a string into words, but also get the starting and ending offset of the words. So, for example, if the input string is:
input_string = "ONE ONE ONE \t TWO TWO ONE TWO TWO THREE"
I want to get:
[('ONE', 0, 2), ('ONE', 5, 7), ('ONE', 9, 11), ('TWO', 17, 19), ('TWO', 21, 23),
('ONE', 25, 27), ('TWO', 29, 31), ('TWO', 33, 35), ('THREE', 37, 41)]
I've got some working code that does this using input_string.split and calls to .index, but it's slow. I tried to code it by manually iterating through the string, but that was slower still. Does anyone have a fast algorithm for this?
Here are my two versions:
def using_split(line):
words = line.split()
offsets = []
running_offset = 0
for word in words:
word_offset = line.index(word, running_offset)
word_len = len(word)
running_offset = word_offset + word_len
offsets.append((word, word_offset, running_offset - 1))
return offsets
def manual_iteration(line):
start = 0
offsets = []
word = ''
for off, char in enumerate(line + ' '):
if char in ' \t\r\n':
if off > start:
offsets.append((word, start, off - 1))
start = off + 1
word = ''
else:
word += char
return offsets
By using timeit, "using_split" is the fastest, followed by "manual_iteration", then the slowest so far is using re.finditer as suggested below.
Upvotes: 15
Views: 5930
Reputation: 11781
It occurs to me the python loop is the slow operation here, thus I started on bitmaps, I got this far and it's still fast, but I can't figure out a loop-free way to get start/stop indices out it:
import string
table = "".join([chr(i).isspace() and "0" or "1" for i in range(256)])
def indexed6(line):
binline = string.translate(line, table)
return int(binline, 2) ^ int(binline+"0", 2)
The returned integer has bits set for each start position and each stop+1 position.
P.S. zip() is relatively slow: fast enough to be used once, too slow to be used 3 times.
Upvotes: 0
Reputation: 1742
I was able to get about a 35% speedup in a few minutes by outright cheating: I converted your using_split() function into a C-based python module using cython. This was the first excuse I've ever had to try cython, and I see that it's pretty easy and rewarding -- see below.
Punting into C was a last resort: First I spent a few hours messing around trying to find a faster algorithm than your using_split() version. The thing is, the native python str.split() is surprisingly fast, faster than anything I tried using numpy or re, for instance. So even though you're scanning the string twice, str.split() is fast enough that it doesn't seem to matter, at least not for this particular test data.
To use cython, I put your parser in a file named parser.pyx:
===================== parser.pyx ==============================
def using_split(line):
words = line.split()
offsets = []
running_offset = 0
for word in words:
word_offset = line.index(word, running_offset)
word_len = len(word)
running_offset = word_offset + word_len
offsets.append((word, word_offset, running_offset - 1))
return offsets
===============================================================
Then I ran this to install cython (assuming a debian-ish Linux box):
sudo apt-get install cython
Then I called the parser from this python script:
================== using_cython.py ============================
#!/usr/bin/python
import pyximport; pyximport.install()
import parser
input_string = "ONE ONE ONE \t TWO TWO ONE TWO TWO THREE"
def parse():
return parser.using_split(input_string)
===============================================================
To test, I ran this:
python -m timeit "import using_cython; using_cython.parse();"
On my machine, your pure-python using_split() function averages around 8.5 usec runtime, while my cython version averages around 5.5 usec.
More details at http://docs.cython.org/src/userguide/source_files_and_compilation.html
Upvotes: 2
Reputation: 17173
Here are a couple of ideas which you could profile to see if they are fast enough:
input_string = "".join([" ","ONE ONE ONE \t TWO TWO ONE TWO TWO THREE"," "])
#pre processing
from itertools import chain
stuff = list(chain(*zip(range(len(input_string)),range(len(input_string)))))
print stuff
stuff = iter(stuff)
next(stuff)
#calculate
switches = (i for i in range(0,len(input_string)-1) if (input_string[next(stuff)] in " \t\r\n") ^ (input_string[next(stuff)] in " \t\r\n"))
print [(word,next(switches),next(switches)-1) for word in input_string.split()]
#pre processing
from itertools import chain
stuff = list(chain(*zip(range(len(input_string)),range(len(input_string)))))
print stuff
stuff = iter(stuff)
next(stuff)
#calculate
switches = (i for i in range(0,len(input_string)-1) if (input_string[next(stuff)] in " \t\r\n") ^ (input_string[next(stuff)] in " \t\r\n"))
print [(input_string[i:j+1],i,j-1) for i,j in zip(switches,switches)]
Upvotes: 0
Reputation: 17173
Warning, the speed of this solution is limited by the speed of light:
def get_word_context(input_string):
start = 0
for word in input_string.split():
c = word[0] #first character
start = input_string.find(c,start)
end = start + len(word) - 1
yield (word,start,end)
start = end + 2
print list(get_word_context("ONE ONE ONE \t TWO TWO ONE TWO TWO THREE"))
[('ONE', 0, 2), ('ONE', 5, 7), ('ONE', 9, 11), ('TWO', 17, 19), ('TWO', 21, 23), ('ONE', 25, 27), ('TWO', 29, 31), ('TWO', 33, 35), ('THREE', 37, 41)]
Upvotes: 3
Reputation: 24788
This seems to work pretty quickly:
tuple_list = [(match.group(), match.start(), match.end()) for match in re.compile("\S+").finditer(input_string)]
Upvotes: 1
Reputation: 1636
Here you have some c-oriented approach, that only iterates once over the complete string. You can also define your own seperators. Tested and works, but could probably be cleaner.
def mySplit(myString, mySeperators):
w = []
o = 0
iW = False
word = [None, None,None]
for i,c in enumerate(myString):
if not c in mySeperators:
if not iW:
word[1]=i
iW = True
if iW == True and c in mySeperators:
word[2]=i-1
word[0] = myString[word[1]:i]
w.append(tuple(word))
word=[None,None,None]
iW = False
return w
mySeperators = [" ", "\t"]
myString = "ONE ONE ONE \t TWO TWO ONE TWO TWO THREE"
splitted = mySplit(myString, mySeperators)
print splitted
Upvotes: 0
Reputation: 996
The following ideas may result in speed-ups:
Note: I haven't tested these but here is an example
from collections import deque
def using_split(line):
MAX_WORD_LENGTH = 10
line_index = line.index
words = line.split()
offsets = deque()
offsets_append = offsets.append
running_offset = 0
for word in words:
word_offset = line_index(word, running_offset, running_offset+MAX_WORD_LENGTH)
running_offset = word_offset + len(word)
offsets_append((word, word_offset, running_offset - 1))
return list(offsets)
Upvotes: 0
Reputation: 19114
The following runs slightly faster - it saves about 30%. All I did was define the functions in advance:
def using_split2(line, _len=len):
words = line.split()
index = line.index
offsets = []
append = offsets.append
running_offset = 0
for word in words:
word_offset = index(word, running_offset)
word_len = _len(word)
running_offset = word_offset + word_len
append((word, word_offset, running_offset - 1))
return offsets
Upvotes: 10
Reputation: 500317
The following will do it:
import re
s = 'ONE ONE ONE \t TWO TWO ONE TWO TWO THREE'
ret = [(m.group(0), m.start(), m.end() - 1) for m in re.finditer(r'\S+', s)]
print(ret)
This produces:
[('ONE', 0, 2), ('ONE', 5, 7), ('ONE', 9, 11), ('TWO', 17, 19), ('TWO', 21, 23),
('ONE', 25, 27), ('TWO', 29, 31), ('TWO', 33, 35), ('THREE', 37, 41)]
Upvotes: 25
Reputation: 363547
def split_span(s):
for match in re.finditer(r"\S+", s):
span = match.span()
yield match.group(0), span[0], span[1] - 1
Upvotes: 7