Reputation: 1088
I know that to count each unique value of a column and turning it into percentage I can use:
df['name_of_the_column'].value_counts(normalize=True)*100
I wonder how can I do this for all the columns as a function and then drop the column where a unique value in a given column has above 95% of all values? Note that the function should also count the NaN values.
Upvotes: 2
Views: 5694
Reputation: 13407
You can write a small wrapper around value_counts
that returns False if any value is above some threshold, and True if the counts look good:
Sample Data
import pandas as pd
import numpy as np
df = pd.DataFrame({
"A": [1] * 20, # should NOT survive
"B": [1, 0] * 10, # should survive
"C": [np.nan] * 20, # should NOT survive
"D": [1,2,3,4] * 5, # should survive
"E": [0] * 18 + [np.nan, np.nan] # should survive
})
print(df.head())
Implementation
def threshold_counts(s, threshold=0):
counts = s.value_counts(normalize=True, dropna=False)
if (counts >= threshold).any():
return False
return True
column_mask = df.apply(threshold_counts, threshold=0.95)
clean_df = df.loc[:, column_mask]
print(clean_df.head())
B D E
0 1 1 0.0
1 0 2 0.0
2 1 3 0.0
3 0 4 0.0
4 1 1 0.0
Upvotes: 2
Reputation: 10624
You can try this:
l=df.columns
for i in l:
res=df[i].value_counts(normalize=True)*100
if res.iloc[0]>=95:
del df[i]
Upvotes: 3