diegoaguilar
diegoaguilar

Reputation: 8376

Best performance method for evaluating presence of substrings within strings in python?

I know I can do:

word = 'shrimp'
sentence = 'They eat shrimp in Mexico, and even more these days'
word in sentence

And so it'd would evaluate True.

But what if I have:

words = ['a','la','los','shrimp']

How can I evaluate whether sentence contains any words elements?

I care for performance as list might be large and I don't want to extend code lines using loops

Upvotes: 1

Views: 101

Answers (4)

ajknzhol
ajknzhol

Reputation: 6450

>>> words = ['a','la','los','shrimp']
>>> sentence = 'They eat shrimp in Mexico, and more these days'
>>> [word for word in words if word in sentence]
['a', 'shrimp']
>>> any(word in sentence for word in words)
True

words = ['a','la','los','shrimp']
sentence = 'They eat shrimp in Mexico, and more these days'
a = map(lambda word: word in words if word in sentence else None, words)
print(a.__next__())
True

There are many ways to perform this.

Since you asked for Performance, here are the results.

In [1]: words = ['a','la','los','shrimp']

In [2]: sentence = 'They eat shrimp in Mexico

In [3]: a = map(lambda word: word in words if word in sentence else None, words)

In [4]: a.__next__()
Out[4]: True

In [5]: %timeit a
10000000 loops, best of 3: 65.3 ns per loop

In [6]: b = any(word in sentence for word in words)

In [7]: %timeit b
10000000 loops, best of 3: 50.1 ns per loop

words = ['a','la','fresh','shrimp','ban']
sentence = 'They fresh bananas in Mexico, and more these days'
a = [word for word in words if word in sentence.split(' ')]
print(a) #['fresh']

b = any(word in sentence.split(' ') for word in words)
print(b) #True

Upvotes: 3

James Sapam
James Sapam

Reputation: 16940

Here is what i Found.

Testing is done on 100k words:

NB: Generating sets is expensive inside the function if your list is long, BUT if you can pre-process into set before calling the function then, set will win.

Snippet:

#!/usr/bin/python

import cProfile
from timeit import Timer
from faker import Faker


def func1(sentence, words):
    return any(word in sentence for word in words)

def func2(sentence, words):
    for word in words:
        if word in sentence:
            return True
    return False

def func3(sentence, words):
    # using set
    sets = set(sentence)
    return bool(set(words).intersection(sets))

def func4(sentence, words):
    # using set
    sets = set(sentence)
    for word in words:
        if word in sets:
            return True
    return False

def func5(sentence, words):
    # using set
    sentence, words = set(sentence), set(words)
    return not words.isdisjoint(sentence)

s = Faker()
sentence = s.sentence(nb_words=100000).split()
words = 'quidem necessitatibus minus id quos in neque omnis molestias'.split()

func = [ func1, func2, func3, func4, func5 ]

for fun in func:
    t = Timer(lambda: fun(sentence, words))
    print fun.__name__, cProfile.run('t.timeit(number=1000)')

Output:

Result: func2 wins

func1          5011 function calls in 0.032 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.032    0.032 <string>:1(<module>)
     1000    0.000    0.000    0.032    0.000 exists.py:42(<lambda>)
     1000    0.001    0.000    0.032    0.000 exists.py:8(func1)
     2000    0.030    0.000    0.030    0.000 exists.py:9(<genexpr>)
        1    0.000    0.000    0.000    0.000 timeit.py:143(setup)
        1    0.000    0.000    0.032    0.032 timeit.py:178(timeit)
        1    0.000    0.000    0.032    0.032 timeit.py:96(inner)
     1000    0.000    0.000    0.030    0.000 {any}
        1    0.000    0.000    0.000    0.000 {gc.disable}
        1    0.000    0.000    0.000    0.000 {gc.enable}
        1    0.000    0.000    0.000    0.000 {gc.isenabled}
        1    0.000    0.000    0.000    0.000 {globals}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 {time.time}


None
func2          2011 function calls in 0.031 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.031    0.031 <string>:1(<module>)
     1000    0.031    0.000    0.031    0.000 exists.py:11(func2)
     1000    0.000    0.000    0.031    0.000 exists.py:42(<lambda>)
        1    0.000    0.000    0.000    0.000 timeit.py:143(setup)
        1    0.000    0.000    0.031    0.031 timeit.py:178(timeit)
        1    0.000    0.000    0.031    0.031 timeit.py:96(inner)
        1    0.000    0.000    0.000    0.000 {gc.disable}
        1    0.000    0.000    0.000    0.000 {gc.enable}
        1    0.000    0.000    0.000    0.000 {gc.isenabled}
        1    0.000    0.000    0.000    0.000 {globals}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 {time.time}


None
func3          3011 function calls in 7.079 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    7.079    7.079 <string>:1(<module>)
     1000    7.069    0.007    7.073    0.007 exists.py:17(func3)
     1000    0.004    0.000    7.077    0.007 exists.py:42(<lambda>)
        1    0.000    0.000    0.000    0.000 timeit.py:143(setup)
        1    0.000    0.000    7.079    7.079 timeit.py:178(timeit)
        1    0.002    0.002    7.079    7.079 timeit.py:96(inner)
        1    0.000    0.000    0.000    0.000 {gc.disable}
        1    0.000    0.000    0.000    0.000 {gc.enable}
        1    0.000    0.000    0.000    0.000 {gc.isenabled}
        1    0.000    0.000    0.000    0.000 {globals}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
     1000    0.004    0.000    0.004    0.000 {method 'intersection' of 'set' objects}
        2    0.000    0.000    0.000    0.000 {time.time}


None
func4          2011 function calls in 7.022 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    7.022    7.022 <string>:1(<module>)
     1000    7.014    0.007    7.014    0.007 exists.py:22(func4)
     1000    0.006    0.000    7.020    0.007 exists.py:42(<lambda>)
        1    0.000    0.000    0.000    0.000 timeit.py:143(setup)
        1    0.000    0.000    7.022    7.022 timeit.py:178(timeit)
        1    0.002    0.002    7.022    7.022 timeit.py:96(inner)
        1    0.000    0.000    0.000    0.000 {gc.disable}
        1    0.000    0.000    0.000    0.000 {gc.enable}
        1    0.000    0.000    0.000    0.000 {gc.isenabled}
        1    0.000    0.000    0.000    0.000 {globals}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 {time.time}


None
func5          3011 function calls in 7.142 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    7.142    7.142 <string>:1(<module>)
     1000    7.133    0.007    7.134    0.007 exists.py:30(func5)
     1000    0.006    0.000    7.140    0.007 exists.py:42(<lambda>)
        1    0.000    0.000    0.000    0.000 timeit.py:143(setup)
        1    0.000    0.000    7.142    7.142 timeit.py:178(timeit)
        1    0.002    0.002    7.142    7.142 timeit.py:96(inner)
        1    0.000    0.000    0.000    0.000 {gc.disable}
        1    0.000    0.000    0.000    0.000 {gc.enable}
        1    0.000    0.000    0.000    0.000 {gc.isenabled}
        1    0.000    0.000    0.000    0.000 {globals}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
     1000    0.002    0.000    0.002    0.000 {method 'isdisjoint' of 'set' objects}
        2    0.000    0.000    0.000    0.000 {time.time}


None

Upvotes: 1

Ramchandra Apte
Ramchandra Apte

Reputation: 4079

Could be read as English. But might not be the most performant. First benchmark and then only if needed, use an optimized solution.

any(word in sentence for word in words)

any returns True if any element is True, otherwise False. (word in sentence for word in words) is an iterable, which produces whether each word is in the sentence or not.

Upvotes: 7

Josh Durham
Josh Durham

Reputation: 1662

for word in words:
    if word in sentence:
        #do stuff

Upvotes: 1

Related Questions