Reputation: 310
My objective is, I have a dataframe in which all columns are categorical variables. I want to remove the columns if any columns's category has exceeded the threshold/cut off.
Var1 | Var2 | Var3 |
---|---|---|
Male | a | Lite |
Male | b | Full |
Male | c | Full |
Male | c | Lite |
Here, you can see that in var1 all are Male (100% proportion) now if I pass 51% as my cut off/threshold function should give me the following:
Var2 | Var3 |
---|---|
a | Lite |
b | Full |
c | Full |
c | Lite |
def dist_drop(df,cut):
vars = df.columns
var = []
for i in df.columns:
val_p = df[i].value_counts(normalize = True)*100
a = val_p.max()
if a[i] <= cut:
var.append = df[var]
new_df = df[var]
return new_df
Upvotes: 0
Views: 248
Reputation: 120409
Use apply
to iterate over your columns and create a boolean mask. Return only columns that match your condition.
def dist_drop(df, cut):
mask = df.apply(lambda x: x.value_counts(normalize=True)
.mul(100)
.le(cut)
.all())
return df.loc[:, mask]
>>> dist_drop(df, 51)
Var2 Var3
0 a Lite
1 b Full
2 c Full
3 c Lite
Upvotes: 3
Reputation: 516
def dist_drop(df,cut):
vars = df.columns
var = []
for i in df.columns:
val_p = df[i].value_counts(normalize = True)*100
a = val_p.max()
if a[i] <= cut: # Condition should be just a - the max value
var.append = df[var] # Assigning df to append instead of appending column
new_df = df[var] # this should be after the for loop is done
return new_df # this should be after the for loop is done
Made minor changes to your function. This should serve your purpose.
def dist_drop(df,cut):
var = []
for i in df.columns:
val_p = df[i].value_counts(normalize = True)*100
max_id = val_p.idxmax()
max_val = val_p[max_id]
if max_val <= cut:
var.append(i)
new_df = df[var]
return new_df
Upvotes: 1