cHaTrU
cHaTrU

Reputation: 95

split python string on multiple string delimiters efficiently

Suppose I have a string such as "Let's split this string into many small ones" and I want to split it on this, into and ones

such that the output looks something like this:

["Let's split", "this string", "into many small", "ones"]

What is the most efficient way to do it?

Upvotes: 3

Views: 2879

Answers (3)

mgilson
mgilson

Reputation: 309909

Here's a fairly lazy way to do it:

import re

def resplit(regex,s):
    current = None
    for x in regex.finditer(s):
        start = x.start()
        yield s[current:start]
        current = start
    yield s[start:]

s = "Let's split this string into many small ones"
regex = re.compile('(this|into|ones)')
print list( resplit(regex,s) )

I don't know for sure if this is the most efficient, but it's pretty clean.

Basically, we just iterate through the matches taking 1 piece at a time. The pieces are determined by the index in the string (s) where the regex starts to match. We just chop the string up until that point and we save that index as the start point of the next slice.


As for performance, ignacio clearly wins this round:

9.1412050724  -- Me
3.09771895409  -- ignacio

Code:

import re

def resplit(regex,s):
    current = None
    for x in regex.finditer(s):
        start = x.start()
        yield s[current:start]
        current = start
    yield s[start:]


def me(regex,s):
    return list(resplit(regex,s))

def ignacio(regex,s):
    return regex.split("Let's split this string into many small ones")

s = "Let's split this string into many small ones"
regex = re.compile('(this|into|ones)')
regex2 = re.compile(r'\s(?=(?:this|into|ones)\b)')

import timeit
print timeit.timeit("me(regex,s)","from __main__ import me,regex,s")
print timeit.timeit("ignacio(regex2,s)","from __main__ import ignacio,regex2,s")

Upvotes: 1

Martijn Pieters
Martijn Pieters

Reputation: 1121834

By using re.split():

>>> re.split(r'(this|into|ones)', "Let's split this string into many small ones")
["Let's split ", 'this', ' string ', 'into', ' many small ', 'ones', '']

By putting the words to split on in a capturing group, the output includes the words we split on.

If you need the spaces removed, use map(str.strip, result) on the re.split() output:

>>> map(str.strip, re.split(r'(this|into|ones)', "Let's split this string into many small ones"))
["Let's split", 'this', 'string', 'into', 'many small', 'ones', '']

and you could use filter(None, result) to remove any empty strings if need be:

>>> filter(None, map(str.strip, re.split(r'(this|into|ones)', "Let's split this string into many small ones")))
["Let's split", 'this', 'string', 'into', 'many small', 'ones']

To split on words but keep them attached to the following group, you need to use a lookahead assertion instead:

>>> re.split(r'\s(?=(?:this|into|ones)\b)', "Let's split this string into many small ones")
["Let's split", 'this string', 'into many small', 'ones']

Now we are really splitting on whitespace, but only on whitespace that is followed by a whole word, one in the set of this, into and ones.

Upvotes: 3

Ignacio Vazquez-Abrams
Ignacio Vazquez-Abrams

Reputation: 798626

With a lookahead.

>>> re.split(r'\s(?=(?:this|into|ones)\b)', "Let's split this string into many small ones")
["Let's split", 'this string', 'into many small', 'ones']

Upvotes: 11

Related Questions