Reputation: 3885
I want to count occurrences of list elements in text with Python. I know that I can use .count()
but I have read that this can effect on performance. Also, element in list can have more than 1 word.
my_list = ["largest", "biggest", "greatest", "the best"]
my_text = "i have the biggest house and the biggest car. My friend is the best. Best way win this is to make the largest house and largest treehouse and then you will be the greatest"
I can do this:
num = 0
for i in my_list:
num += my_text.lower().count(i.lower())
print(num)
This way works, but what If my list has 500 elements and my string is 3000 words, so in that case, I have very low performance.
Is there a way to do this but with good / fast performance?
Upvotes: 1
Views: 1030
Reputation: 88226
Since my_list
contains strings with more than one word, you'll have to find the n-grams
of my_text
to find matches, since splitting on spaces won't do. Also note that your approach is not advisable, as for every single string in my_list
, you'll be traversing the whole string my_text
by using count
. A better way would be to predefine the n-grams
that you'll be looking for beforehand.
Here's one approach using nltk
's ngram
.
I've added another string in my_list
to better illustrate the process:
from nltk import ngrams
from collections import Counter, defaultdict
my_list = ["largest", "biggest", "greatest", "the best", 'My friend is the best']
my_text = "i have the biggest house and the biggest car. My friend is the best. Best way win this is to make the largest house and largest treehouse and then you will be the greatest"
The first step is to define a dictionary containing the different lengths of the n-grams that we'll be looking up:
d = defaultdict(list)
for i in my_list:
k = i.split()
d[len(k)].append(tuple(k))
print(d)
defaultdict(list,
{1: [('largest',), ('biggest',), ('greatest',)],
2: [('the', 'best')],
5: [('My', 'friend', 'is', 'the', 'best')]})
Then split my_text
into a list, and for each key in d
find the corresponding n-grams
and build a Counter
from the result. Then for each value in that specific key in d
, update with the counts from the Counter
:
my_text_split = my_text.replace('.', '').split()
match_counts = dict()
for n,v in d.items():
c = Counter(ngrams(my_text_split, n))
for k in v:
if k in c:
match_counts[k] = c[k]
Which will give:
print(match_counts)
{('largest',): 2,
('biggest',): 2,
('greatest',): 1,
('the', 'best'): 1,
('My', 'friend', 'is', 'the', 'best'): 1}
Upvotes: 2