James Schinner
James Schinner

Reputation: 1579

Python best way to remove multiple strings from string

Python 3.6

I'd like to remove a list of strings from a string. Here is my first poor attempt:

string = 'this is a test string'
items_to_remove = ['this', 'is', 'a', 'string']
result = list(filter(lambda x: x not in items_to_remove, string.split(' ')))
print(result)

output:

['test']

But this doesn't work if x isn't nicely spaced. I feel there must be a builtin solution, hmm There must be a better way!

I've had a look at this discussion on stack overflow, exact question as mine...

Not to waste my efforts. I timed all the solutions. I believe the easiest, fastest and most pythonic is the simple for loop. Which was not the conclusion in the other post...

result = string
for i in items_to_remove:
    result = result.replace(i,'')

Test Code:

import timeit

t1 = timeit.timeit('''
string = 'this is a test string'
items_to_remove = ['this', 'is', 'a', 'string']
result = list(filter(lambda x: x not in items_to_remove, string.split(' ')))
''', number=1000000)
print(t1)

t2 = timeit.timeit('''
string = 'this is a test string'
items_to_remove = ['this', 'is', 'a', 'string']
def sub(m):
    return '' if m.group() in items_to_remove else m.group()

result = re.sub(r'\w+', sub, string)
''',setup= 'import re', number=1000000)
print(t2)

t3 = timeit.timeit('''
string = 'this is a test string'
items_to_remove = ['this', 'is', 'a', 'string']
result = re.sub(r'|'.join(items_to_remove), '', string)
''',setup= 'import re', number=1000000)
print(t3)

t4 = timeit.timeit('''
string = 'this is a test string'
items_to_remove = ['this', 'is', 'a', 'string']
result = string
for i in items_to_remove:
    result = result.replace(i,'')
''', number=1000000)
print(t4)

outputs:

1.9832003884248448
4.408749988641971
2.124719851741177
1.085117268194475

Upvotes: 7

Views: 11202

Answers (1)

cs95
cs95

Reputation: 402473

You can use string.split() if you aren't confident of your string spacing.

string.split() and string.split(' ') work a little differently:

In [128]: 'this     is   a test'.split()
Out[128]: ['this', 'is', 'a', 'test']

In [129]: 'this     is   a test'.split(' ')
Out[129]: ['this', '', '', '', '', 'is', '', '', 'a', 'test']

The former splits your string without any redundant empty strings.

If you want a little more security, or if your strings could contain tabs and newlines, there's another solution with regex:

In [131]: re.split('[\s]+',  'this     is \t  a\ntest', re.M)
Out[131]: ['this', 'is', 'a', 'test']

Lastly, I would suggest converting your lookup list into a lookup set for efficient lookup in your filter:

In [135]: list(filter(lambda x: x not in {'is', 'this', 'a', 'string'}, string.split()))
Out[135]: ['test']

While on the topic of performance, a list comp is a bit faster than a filter, although less concise:

In [136]: [x for x in string.split() if x not in {'is', 'this', 'a', 'string'}]
Out[136]: ['test']

Upvotes: 6

Related Questions