Ben P
Ben P

Reputation: 3379

Filter Pandas series based on .sum() totals

I have data that contains a row per user, then many columns populated with 1 or 0 based on their interaction with a particular product category.

I am running some correlation analysis, and I'd like to remove the less significant categories to make my analysis easier to read, I used .sum() on my dataframe to see the categories that are interacted with most, but how can I now run correlation on just this set?

Here is the a sample of the outpul from my .sum():

shoes_and_flats                                                                                           37
nightwear_and_slippers                                                                                    61
shorts_and_shorts                                                                                         23
accessories_and_fragrance                                                                                 25
jackets_and_coats_and_wool                                                                                12
dresses_and_skirts_and_sleeveless_dresses                                                                 35
swimwear_and_bikinis                                                                                      49
dresses_and_skirts_and_floral_dresses                                                                      7
jackets_and_coats_and_harrington_jackets                                                                  18
dresses_and_skirts_and_tunic_dresses                                                                       8
sports_performance_tops_and_vests                                                                          4
jeans_and_bootcut_jeans                                                                                    2
nightwear_and_nightwear                                                                                    1

Created by doing...

totals = df.sum()

I decided that I'd like to remove categories with less than 50 interactions, so I used... totals = totals[1: -1].sort_values() > 50

But that returns all categories regardless of their True or False value.

My end goal is to use .corr() on the data, how can I run this and only return a grid where the categories have more than 50 interactions?

Upvotes: 1

Views: 1015

Answers (2)

Dillon
Dillon

Reputation: 999

You want to filter the columns in the dataframe. You're on the right track with the True and False results, you just have to use this as a filter

Assuming the data is in a dataframe called df, this will return only the columns you want:

totals = df.sum()
df[totals[totals > 50].index]

Upvotes: 2

Ana Cruz
Ana Cruz

Reputation: 36

I believe you could use:

totals = totals[totals > 50]

Edit: The syntax of the accepted answer above was not working for me so just in case this happens to someone else here is what I found worked

totals = df.sum()
totals = totals[totals > 50]
df_more_than_50 = df.filter(totals.index))

Upvotes: 1

Related Questions