Creating a function for dropping the variables(categorical) from dataframe based on threshold

Question

My objective is, I have a dataframe in which all columns are categorical variables. I want to remove the columns if any columns's category has exceeded the threshold/cut off.

Var1	Var2	Var3
Male	a	Lite
Male	b	Full
Male	c	Full
Male	c	Lite

Here, you can see that in var1 all are Male (100% proportion) now if I pass 51% as my cut off/threshold function should give me the following:

Var2	Var3
a	Lite
b	Full
c	Full
c	Lite

def dist_drop(df,cut):
    vars = df.columns
    var = []
    for i in df.columns:
        val_p = df[i].value_counts(normalize = True)*100
        a = val_p.max()
        if a[i] <= cut:
            var.append = df[var]
            new_df = df[var]
       return new_df

Amith Lakkakula · Accepted Answer

def dist_drop(df,cut):
    vars = df.columns
    var = []
    for i in df.columns:
        val_p = df[i].value_counts(normalize = True)*100
        a = val_p.max()
        if a[i] <= cut: # Condition should be just a - the max value
            var.append = df[var] # Assigning df to append instead of appending column
            new_df = df[var] # this should be after the for loop is done
       return new_df # this should be after the for loop is done

Made minor changes to your function. This should serve your purpose.

def dist_drop(df,cut):
    var = []
    for i in df.columns:
        val_p = df[i].value_counts(normalize = True)*100
        max_id = val_p.idxmax()
        max_val = val_p[max_id]
        if max_val <= cut:
            var.append(i)
    new_df = df[var]
    return new_df

Creating a function for dropping the variables(categorical) from dataframe based on threshold

Answers (2)

Related Questions