taga
taga

Reputation: 3885

Count occurrences of list of strings in text

I want to count occurrences of list elements in text with Python. I know that I can use .count() but I have read that this can effect on performance. Also, element in list can have more than 1 word.

my_list = ["largest", "biggest", "greatest", "the best"]

my_text = "i have the biggest house and the biggest car. My friend is the best. Best way win this is to make the largest house and largest treehouse and then you will be the greatest"

I can do this:

num = 0
for i in my_list:
   num += my_text.lower().count(i.lower())

print(num)

This way works, but what If my list has 500 elements and my string is 3000 words, so in that case, I have very low performance.

Is there a way to do this but with good / fast performance?

Upvotes: 1

Views: 1030

Answers (1)

yatu
yatu

Reputation: 88226

Since my_list contains strings with more than one word, you'll have to find the n-grams of my_text to find matches, since splitting on spaces won't do. Also note that your approach is not advisable, as for every single string in my_list, you'll be traversing the whole string my_text by using count. A better way would be to predefine the n-grams that you'll be looking for beforehand.

Here's one approach using nltk's ngram. I've added another string in my_list to better illustrate the process:

from nltk import ngrams
from collections import Counter, defaultdict

my_list = ["largest", "biggest", "greatest", "the best", 'My friend is the best']
my_text = "i have the biggest house and the biggest car. My friend is the best. Best way win this is to make the largest house and largest treehouse and then you will be the greatest"

The first step is to define a dictionary containing the different lengths of the n-grams that we'll be looking up:

d = defaultdict(list)
for i in my_list:
    k = i.split()
    d[len(k)].append(tuple(k))

print(d)
defaultdict(list,
            {1: [('largest',), ('biggest',), ('greatest',)],
             2: [('the', 'best')],
             5: [('My', 'friend', 'is', 'the', 'best')]})

Then split my_text into a list, and for each key in d find the corresponding n-grams and build a Counter from the result. Then for each value in that specific key in d, update with the counts from the Counter:

my_text_split = my_text.replace('.', '').split()
match_counts = dict()
for n,v in d.items():
    c = Counter(ngrams(my_text_split, n))
    for k in v:   
        if k in c:
            match_counts[k] = c[k] 

Which will give:

print(match_counts)

{('largest',): 2,
 ('biggest',): 2,
 ('greatest',): 1,
 ('the', 'best'): 1,
 ('My', 'friend', 'is', 'the', 'best'): 1}

Upvotes: 2

Related Questions