Pandas: Article category prediction performance slow

Question

I'm working on a naive multinomial bayes classifier for articles in Pandas and have run into a bit of an issue with performance. My repo is here if you want the full code and the dataset I'm using: https://github.com/kingcodefish/multinomial-bayesian-classification/blob/master/main.ipynb

Here's my current setup with two dataframes: df for the articles with lists of tokenized words and word_freq to store precomputed frequency and P(word | category) values.

for category in df['category'].unique():
    category_filter = word_freq.loc[word_freq['category'] == category]
    cat_articles = df.loc[df['category'] == category].shape[0] # The number of categorized articles
    p_cat = cat_articles / df.shape[0] # P(Cat) = # of articles per category / # of articles
    df[category] = df['content'].apply(lambda x: category_filter[category_filter['word'].isin(x)]['p_given_cat'].prod()) * p_cat

Example data:

df

        category                                            content
0   QUEER VOICES  [online, dating, thoughts, first, date, grew, ...
1        COLLEGE  [wishes, class, believe, generation, better, j...
2       RELIGION  [six, inspiring, architectural, projects, revi...
3       WELLNESS  [ultramarathon, runner, micah, true, died, hea...
4  ENTERTAINMENT  [miley, cyrus, ball, debuts, album, art, cyrus...

word_freq

           category         word  freq  p_given_cat
46883         MEDIA         seat   1.0     0.333333
14187         CRIME         ends   1.0     0.333333
81317    WORLD NEWS         seat   1.0     0.333333
12463        COMEDY       living   1.0     0.200000
20868     EDUCATION     director   1.0     0.500000

Please note that the word_freq table is a cross product of the categories x words, so every word appears once and only once in each category, so the table does contain duplicates. Also, the freq column has been increased by 1 to avoid zero values (Laplace smoothed).

After running the above, I do this to find the max category P (each category's P is stored in a column after its name) and get the following:

df['predicted_category'] = df[df.columns.difference(['category', 'content'])].idxmax(axis=1)
df = df.drop(df.columns.difference(['category', 'content', 'predicted_category']), axis=1).reset_index(drop = True)

          category                                            content  \
0         POLITICS  [bernie, sanders, campaign, split, whether, fi...   
1           COMEDY  [bill, maher, compares, police, unions, cathol...   
2         WELLNESS  [busiest, people, earth, find, time, relax, th...   
3    ENTERTAINMENT  [lamar, odom, gets, standing, ovation, first, ...   
4            GREEN                      [lead, longer, life, go, gut]   
   predicted_category  
0                ARTS  
1                ARTS  
2                ARTS  
3               TASTE  
4               GREEN

This method seems to work well, but it is unfortunately really slow. I am using a large dataset of 200,000 articles with short descriptions and operating on only 1% of this is taking almost a minute. I know it's because I am looping through the categories instead of relying on vectorization, but I am very very new to Pandas and trying to formulate this in a groupby succinctly escapes me (especially with the two data tables, also might be unnecessary), so I'm looking for suggestions here.

Thanks!

Pandas: Article category prediction performance slow

Answers (1)

Related Questions