Reputation: 95
Suppose I have a string such as
"Let's split this string into many small ones"
and I want to split it on this
, into
and ones
such that the output looks something like this:
["Let's split", "this string", "into many small", "ones"]
What is the most efficient way to do it?
Upvotes: 3
Views: 2879
Reputation: 309909
Here's a fairly lazy way to do it:
import re
def resplit(regex,s):
current = None
for x in regex.finditer(s):
start = x.start()
yield s[current:start]
current = start
yield s[start:]
s = "Let's split this string into many small ones"
regex = re.compile('(this|into|ones)')
print list( resplit(regex,s) )
I don't know for sure if this is the most efficient, but it's pretty clean.
Basically, we just iterate through the matches taking 1 piece at a time. The pieces are determined by the index in the string (s
) where the regex starts to match. We just chop the string up until that point and we save that index as the start point of the next slice.
As for performance, ignacio clearly wins this round:
9.1412050724 -- Me
3.09771895409 -- ignacio
Code:
import re
def resplit(regex,s):
current = None
for x in regex.finditer(s):
start = x.start()
yield s[current:start]
current = start
yield s[start:]
def me(regex,s):
return list(resplit(regex,s))
def ignacio(regex,s):
return regex.split("Let's split this string into many small ones")
s = "Let's split this string into many small ones"
regex = re.compile('(this|into|ones)')
regex2 = re.compile(r'\s(?=(?:this|into|ones)\b)')
import timeit
print timeit.timeit("me(regex,s)","from __main__ import me,regex,s")
print timeit.timeit("ignacio(regex2,s)","from __main__ import ignacio,regex2,s")
Upvotes: 1
Reputation: 1121834
By using re.split()
:
>>> re.split(r'(this|into|ones)', "Let's split this string into many small ones")
["Let's split ", 'this', ' string ', 'into', ' many small ', 'ones', '']
By putting the words to split on in a capturing group, the output includes the words we split on.
If you need the spaces removed, use map(str.strip, result)
on the re.split()
output:
>>> map(str.strip, re.split(r'(this|into|ones)', "Let's split this string into many small ones"))
["Let's split", 'this', 'string', 'into', 'many small', 'ones', '']
and you could use filter(None, result)
to remove any empty strings if need be:
>>> filter(None, map(str.strip, re.split(r'(this|into|ones)', "Let's split this string into many small ones")))
["Let's split", 'this', 'string', 'into', 'many small', 'ones']
To split on words but keep them attached to the following group, you need to use a lookahead assertion instead:
>>> re.split(r'\s(?=(?:this|into|ones)\b)', "Let's split this string into many small ones")
["Let's split", 'this string', 'into many small', 'ones']
Now we are really splitting on whitespace, but only on whitespace that is followed by a whole word, one in the set of this
, into
and ones
.
Upvotes: 3
Reputation: 798626
With a lookahead.
>>> re.split(r'\s(?=(?:this|into|ones)\b)', "Let's split this string into many small ones")
["Let's split", 'this string', 'into many small', 'ones']
Upvotes: 11