Reputation: 342
I'm working on a naive multinomial bayes classifier for articles in Pandas and have run into a bit of an issue with performance. My repo is here if you want the full code and the dataset I'm using: https://github.com/kingcodefish/multinomial-bayesian-classification/blob/master/main.ipynb
Here's my current setup with two dataframes: df
for the articles with lists of tokenized words and word_freq
to store precomputed frequency and P(word | category) values.
for category in df['category'].unique():
category_filter = word_freq.loc[word_freq['category'] == category]
cat_articles = df.loc[df['category'] == category].shape[0] # The number of categorized articles
p_cat = cat_articles / df.shape[0] # P(Cat) = # of articles per category / # of articles
df[category] = df['content'].apply(lambda x: category_filter[category_filter['word'].isin(x)]['p_given_cat'].prod()) * p_cat
Example data:
df
category content
0 QUEER VOICES [online, dating, thoughts, first, date, grew, ...
1 COLLEGE [wishes, class, believe, generation, better, j...
2 RELIGION [six, inspiring, architectural, projects, revi...
3 WELLNESS [ultramarathon, runner, micah, true, died, hea...
4 ENTERTAINMENT [miley, cyrus, ball, debuts, album, art, cyrus...
word_freq
category word freq p_given_cat
46883 MEDIA seat 1.0 0.333333
14187 CRIME ends 1.0 0.333333
81317 WORLD NEWS seat 1.0 0.333333
12463 COMEDY living 1.0 0.200000
20868 EDUCATION director 1.0 0.500000
Please note that the word_freq
table is a cross product of the categories x words, so every word appears once and only once in each category, so the table does contain duplicates. Also, the freq
column has been increased by 1 to avoid zero values (Laplace smoothed).
After running the above, I do this to find the max category P (each category's P is stored in a column after its name) and get the following:
df['predicted_category'] = df[df.columns.difference(['category', 'content'])].idxmax(axis=1)
df = df.drop(df.columns.difference(['category', 'content', 'predicted_category']), axis=1).reset_index(drop = True)
category content \
0 POLITICS [bernie, sanders, campaign, split, whether, fi...
1 COMEDY [bill, maher, compares, police, unions, cathol...
2 WELLNESS [busiest, people, earth, find, time, relax, th...
3 ENTERTAINMENT [lamar, odom, gets, standing, ovation, first, ...
4 GREEN [lead, longer, life, go, gut]
predicted_category
0 ARTS
1 ARTS
2 ARTS
3 TASTE
4 GREEN
This method seems to work well, but it is unfortunately really slow. I am using a large dataset of 200,000 articles with short descriptions and operating on only 1% of this is taking almost a minute. I know it's because I am looping through the categories instead of relying on vectorization, but I am very very new to Pandas and trying to formulate this in a groupby
succinctly escapes me (especially with the two data tables, also might be unnecessary), so I'm looking for suggestions here.
Thanks!
Upvotes: 1
Views: 79
Reputation: 342
Just in case someone happens to come across this later...
Instead of representing my categories x words as a cross product of every possible word of every category, which inflated to over 3 million rows in my data set, I decided to reduce them to only the necessary ones per category and provide a default value for ones that did not exist, which ended up being about 600k rows.
But the biggest speedup came from changing to the following:
for category in df['category'].unique():
# Calculate P(Category)
category_filter = word_freq.loc[word_freq['category'] == category]
cat_articles = df.loc[df['category'] == category].shape[0]
p_cat = cat_articles / df.shape[0]
# Create a word->P(word | category) dictionary for quick lookups
category_dict = category_filter.set_index('word').to_dict()['p_given_cat']
# For every article, find the product of P(word | category) values of the words, then multiply by P(category) to get bayes.
df[category] = df['content'].apply(lambda x: np.prod([category_dict.get(y, 0.001 / (cat_articles + 0.001)) for y in x])) * p_cat
I created a dictionary from the two columns word
and the P(word | category) as the key-value respectively. This reduced the problem to a quick dictionary lookup for each element of each list and computing that product.
This ended up being about 100x faster, parsing the whole dataset in ~40 seconds.
Upvotes: 1