KingCodeFish
KingCodeFish

Reputation: 342

Pandas: Article category prediction performance slow

I'm working on a naive multinomial bayes classifier for articles in Pandas and have run into a bit of an issue with performance. My repo is here if you want the full code and the dataset I'm using: https://github.com/kingcodefish/multinomial-bayesian-classification/blob/master/main.ipynb

Here's my current setup with two dataframes: df for the articles with lists of tokenized words and word_freq to store precomputed frequency and P(word | category) values.

for category in df['category'].unique():
    category_filter = word_freq.loc[word_freq['category'] == category]
    cat_articles = df.loc[df['category'] == category].shape[0] # The number of categorized articles
    p_cat = cat_articles / df.shape[0] # P(Cat) = # of articles per category / # of articles
    df[category] = df['content'].apply(lambda x: category_filter[category_filter['word'].isin(x)]['p_given_cat'].prod()) * p_cat

Example data:

df

        category                                            content
0   QUEER VOICES  [online, dating, thoughts, first, date, grew, ...
1        COLLEGE  [wishes, class, believe, generation, better, j...
2       RELIGION  [six, inspiring, architectural, projects, revi...
3       WELLNESS  [ultramarathon, runner, micah, true, died, hea...
4  ENTERTAINMENT  [miley, cyrus, ball, debuts, album, art, cyrus...

word_freq

           category         word  freq  p_given_cat
46883         MEDIA         seat   1.0     0.333333
14187         CRIME         ends   1.0     0.333333
81317    WORLD NEWS         seat   1.0     0.333333
12463        COMEDY       living   1.0     0.200000
20868     EDUCATION     director   1.0     0.500000

Please note that the word_freq table is a cross product of the categories x words, so every word appears once and only once in each category, so the table does contain duplicates. Also, the freq column has been increased by 1 to avoid zero values (Laplace smoothed).

After running the above, I do this to find the max category P (each category's P is stored in a column after its name) and get the following:

df['predicted_category'] = df[df.columns.difference(['category', 'content'])].idxmax(axis=1)
df = df.drop(df.columns.difference(['category', 'content', 'predicted_category']), axis=1).reset_index(drop = True)
          category                                            content  \
0         POLITICS  [bernie, sanders, campaign, split, whether, fi...   
1           COMEDY  [bill, maher, compares, police, unions, cathol...   
2         WELLNESS  [busiest, people, earth, find, time, relax, th...   
3    ENTERTAINMENT  [lamar, odom, gets, standing, ovation, first, ...   
4            GREEN                      [lead, longer, life, go, gut]   
   predicted_category  
0                ARTS  
1                ARTS  
2                ARTS  
3               TASTE  
4               GREEN  

This method seems to work well, but it is unfortunately really slow. I am using a large dataset of 200,000 articles with short descriptions and operating on only 1% of this is taking almost a minute. I know it's because I am looping through the categories instead of relying on vectorization, but I am very very new to Pandas and trying to formulate this in a groupby succinctly escapes me (especially with the two data tables, also might be unnecessary), so I'm looking for suggestions here.

Thanks!

Upvotes: 1

Views: 79

Answers (1)

KingCodeFish
KingCodeFish

Reputation: 342

Just in case someone happens to come across this later...

Instead of representing my categories x words as a cross product of every possible word of every category, which inflated to over 3 million rows in my data set, I decided to reduce them to only the necessary ones per category and provide a default value for ones that did not exist, which ended up being about 600k rows.

But the biggest speedup came from changing to the following:

for category in df['category'].unique():
    # Calculate P(Category)
    category_filter = word_freq.loc[word_freq['category'] == category]
    cat_articles = df.loc[df['category'] == category].shape[0]
    p_cat = cat_articles / df.shape[0]
    
    # Create a word->P(word | category) dictionary for quick lookups
    category_dict = category_filter.set_index('word').to_dict()['p_given_cat']
    
    # For every article, find the product of P(word | category) values of the words, then multiply by P(category) to get bayes.
    df[category] = df['content'].apply(lambda x: np.prod([category_dict.get(y, 0.001 / (cat_articles + 0.001)) for y in x])) * p_cat

I created a dictionary from the two columns word and the P(word | category) as the key-value respectively. This reduced the problem to a quick dictionary lookup for each element of each list and computing that product.

This ended up being about 100x faster, parsing the whole dataset in ~40 seconds.

Upvotes: 1

Related Questions