Reputation: 986
I would like to count the number of times a specific topic is brought up in a very long list of words. Currently, I have a dictionary of dictionaries where the outer keys are the topics and the inner keys are the keywords of that topic.
I am trying to efficiently count the keyword occurrences and maintain a cumulative sum of their corresponding topic occurrences.
Ultimately, I want to save the output for multiple texts. This is an example of what I currently have implemented. The issues I have with it are that it is extremely slow and that it does not store the keyword counts in the output DataFrame. Is there an alternative that resolves these issues?
import pandas as pd
topics = {
"mathematics": {
"analysis": 0,
"algebra": 0,
"logic": 0
},
"philosophy": {
"ethics": 0,
"metaphysics": 0,
"epistemology": 0
}
}
texts = {
"text_a": [
"the", "major", "areas", "of", "study", "in", "mathematics", "are",
"analysis", "algebra", "and", "logic", "in", "philosophy", "they",
"are", "ethics", "metaphysics", "and", "epistemology"
],
"text_b": [
"logic", "is", "studied", "both", "in", "mathematics", "and",
"philosophy"
]
}
topics_by_text = pd.DataFrame()
for title, text in texts.items():
topic_count = {}
for topic, sub_dict in topics.items():
curr_topic_counter = 0
for keyword, count in sub_dict.items():
keyword_occurrences = text.count(keyword)
topics[topic][keyword] = keyword_occurrences
curr_topic_counter += keyword_occurrences
topic_count[topic] = curr_topic_counter
topics_by_text[title] = pd.Series(topic_count)
print(topics_by_text)
Upvotes: 0
Views: 72
Reputation: 155
Not sure about the speed, But the following code stores the keywords count in a neat MultiIndexed fashion.
# Returns a count dictionary
def CountFrequency(my_list, keyword):
freq = {}
for item in my_list:
freq[item] = 0
if (item in freq):
freq[item] += 1
else:
freq[item] = 1
dict_ = {}
for your_key,value in keyword.items():
try:
dict_.update({your_key: freq[your_key]})
except:
dict_.update({your_key: 0})
dict_['count'] = sum([value if (value != None) else 0 for value in dict_.values()])
return dict_
# Calculates count
output = {}
for key, value in texts.items():
for topic, keywords in topics.items():
try:
output[topic][key] = CountFrequency(value,keywords)
except KeyError:
output[topic] = {}
output[topic][key] = CountFrequency(value,keywords)
# To DataFrame
dict_of_df = {k: pd.DataFrame(v) for k,v in output.items()}
df = pd.concat(dict_of_df, axis=0)
df.T
Upvotes: 1