Amit Kumar
Amit Kumar

Reputation: 310

Creating a function for dropping the variables(categorical) from dataframe based on threshold

My objective is, I have a dataframe in which all columns are categorical variables. I want to remove the columns if any columns's category has exceeded the threshold/cut off.

Var1 Var2 Var3
Male a Lite
Male b Full
Male c Full
Male c Lite

Here, you can see that in var1 all are Male (100% proportion) now if I pass 51% as my cut off/threshold function should give me the following:

Var2 Var3
a Lite
b Full
c Full
c Lite
def dist_drop(df,cut):
    vars = df.columns
    var = []
    for i in df.columns:
        val_p = df[i].value_counts(normalize = True)*100
        a = val_p.max()
        if a[i] <= cut:
            var.append = df[var]
            new_df = df[var]
       return new_df

Upvotes: 0

Views: 248

Answers (2)

Corralien
Corralien

Reputation: 120409

Use apply to iterate over your columns and create a boolean mask. Return only columns that match your condition.

def dist_drop(df, cut):
    mask = df.apply(lambda x: x.value_counts(normalize=True)
                               .mul(100)
                               .le(cut)
                               .all())
    return df.loc[:, mask]
>>> dist_drop(df, 51)
  Var2  Var3
0    a  Lite
1    b  Full
2    c  Full
3    c  Lite

Upvotes: 3

Amith Lakkakula
Amith Lakkakula

Reputation: 516

def dist_drop(df,cut):
    vars = df.columns
    var = []
    for i in df.columns:
        val_p = df[i].value_counts(normalize = True)*100
        a = val_p.max()
        if a[i] <= cut: # Condition should be just a - the max value
            var.append = df[var] # Assigning df to append instead of appending column
            new_df = df[var] # this should be after the for loop is done
       return new_df # this should be after the for loop is done

Made minor changes to your function. This should serve your purpose.

def dist_drop(df,cut):
    var = []
    for i in df.columns:
        val_p = df[i].value_counts(normalize = True)*100
        max_id = val_p.idxmax()
        max_val = val_p[max_id]
        if max_val <= cut:
            var.append(i)
    new_df = df[var]
    return new_df

Upvotes: 1

Related Questions