Filter Pandas series based on .sum() totals

Question

I have data that contains a row per user, then many columns populated with 1 or 0 based on their interaction with a particular product category.

I am running some correlation analysis, and I'd like to remove the less significant categories to make my analysis easier to read, I used .sum() on my dataframe to see the categories that are interacted with most, but how can I now run correlation on just this set?

Here is the a sample of the outpul from my .sum():

shoes_and_flats                                                                                           37
nightwear_and_slippers                                                                                    61
shorts_and_shorts                                                                                         23
accessories_and_fragrance                                                                                 25
jackets_and_coats_and_wool                                                                                12
dresses_and_skirts_and_sleeveless_dresses                                                                 35
swimwear_and_bikinis                                                                                      49
dresses_and_skirts_and_floral_dresses                                                                      7
jackets_and_coats_and_harrington_jackets                                                                  18
dresses_and_skirts_and_tunic_dresses                                                                       8
sports_performance_tops_and_vests                                                                          4
jeans_and_bootcut_jeans                                                                                    2
nightwear_and_nightwear                                                                                    1

Created by doing...

totals = df.sum()

I decided that I'd like to remove categories with less than 50 interactions, so I used... totals = totals[1: -1].sort_values() > 50

But that returns all categories regardless of their True or False value.

My end goal is to use .corr() on the data, how can I run this and only return a grid where the categories have more than 50 interactions?

Dillon · Accepted Answer

You want to filter the columns in the dataframe. You're on the right track with the True and False results, you just have to use this as a filter

Assuming the data is in a dataframe called df, this will return only the columns you want:

totals = df.sum()
df[totals[totals > 50].index]

Filter Pandas series based on .sum() totals

Answers (2)

Related Questions