Reputation: 3379
I have data that contains a row per user, then many columns populated with 1
or 0
based on their interaction with a particular product category.
I am running some correlation analysis, and I'd like to remove the less significant categories to make my analysis easier to read, I used .sum()
on my dataframe to see the categories that are interacted with most, but how can I now run correlation on just this set?
Here is the a sample of the outpul from my .sum()
:
shoes_and_flats 37
nightwear_and_slippers 61
shorts_and_shorts 23
accessories_and_fragrance 25
jackets_and_coats_and_wool 12
dresses_and_skirts_and_sleeveless_dresses 35
swimwear_and_bikinis 49
dresses_and_skirts_and_floral_dresses 7
jackets_and_coats_and_harrington_jackets 18
dresses_and_skirts_and_tunic_dresses 8
sports_performance_tops_and_vests 4
jeans_and_bootcut_jeans 2
nightwear_and_nightwear 1
Created by doing...
totals = df.sum()
I decided that I'd like to remove categories with less than 50 interactions, so I used...
totals = totals[1: -1].sort_values() > 50
But that returns all categories regardless of their True
or False
value.
My end goal is to use .corr()
on the data, how can I run this and only return a grid where the categories have more than 50 interactions?
Upvotes: 1
Views: 1015
Reputation: 999
You want to filter the columns in the dataframe. You're on the right track with the True
and False
results, you just have to use this as a filter
Assuming the data is in a dataframe called df
, this will return only the columns you want:
totals = df.sum()
df[totals[totals > 50].index]
Upvotes: 2
Reputation: 36
I believe you could use:
totals = totals[totals > 50]
Edit: The syntax of the accepted answer above was not working for me so just in case this happens to someone else here is what I found worked
totals = df.sum()
totals = totals[totals > 50]
df_more_than_50 = df.filter(totals.index))
Upvotes: 1