Reputation: 845
For a current project, I am planning to count the occurrence of a number of specific words within a data set.
For the code line count = word.count(wordlist)
, I am however receiving the following error TypeError: must be str, not list
. Is there any smart way to have Python accept a word list, so not only check for one specific but for several words?
The corresponding code looks like this:
# Word frequency analysis
def get_top_n_bigram(corpus, n=None):
vec = CountVectorizer(ngram_range=(2, 2), stop_words='english').fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
# Analysis loops running through different string sections
for i in ['Text_Pro','Text_Con','Text_Main']:
common_words = get_top_n_bigram(df[i], 500)
for word, freq in common_words:
print(word, freq)
# Analysis loops checking if specific words are found
for word in common_words:
wordlist = ['good', 'management', 'bad']
count = word.count(wordlist)
print(count)
Upvotes: 2
Views: 1654
Reputation: 61526
Since we need to examine every word regardless, we may as well build the entire histogram of word frequency, and then extract the word counts that we're interested in:
from collections import Counter
def words_matching(sentence, candidates):
histogram = Counter(sentence.split())
return sum(histogram[word] for word in candidates)
Upvotes: 1
Reputation: 3961
Use a list comprehension:
count = 0
for word in common_words:
wordlist = ['good', 'management', 'bad']
count += sum([word.count(i) for i in wordlist])
print(count)
As a dictionary, per request from the comment section to this answer:
count = {}
for word in common_words:
wordlist = ["good", "management", "bad"]
count[word] = sum([word.count(i) for i in wordlist])
print(count)
Upvotes: 1
Reputation: 45750
I'd go the more efficient, but more verbose way of checking against a set in an old-fashioned loop:
from typing import Iterable
def count_many(string: str, words: Iterable[str]) -> int:
search_set = set(words) # To ease lookups
split = string.split() # Cut into words
count = 0
for word in split:
if word in search_set:
count += 1
return count
>>> count_many("hello world hello no world hello", ["hello", "world"])
5
Put the words to lookup in a set for faster lookups, split the source text into words, then just loop and count.
This should do, regardless of the length of words
, two iterations of the source text.
Upvotes: 1