Reputation: 421
I have got a dataframe like this:
import pandas as pd
data = {
'c1': ['Test1','Test2','NULL','Test3',' ','Test4','Test4','Test1',"Test3"],
'c2': [' ','Test1',' ','NULL',' ','NULL','NULL','NULL','NULL'],
'c3': [0,0,0,0,0,1,5,0,0],
'c4': ['NULL', 'Test2', 'Test1','Test1', 'Test2', 'Test2','Test1','Test1','Test2']
}
df = pd.DataFrame(data)
df
The dataframe looks like this:
c1 c2 c3 c4
0 Test1 0 NULL
1 Test2 Test1 0 Test2
2 NULL 0 Test1
3 Test3 NULL 0 Test1
4 0 Test2
5 Test4 NULL 1 Test2
6 Test4 NULL 5 Test1
7 Test1 NULL 0 Test1
8 Test3 NULL 0 Test2
I want to drop all columns, that have more than 60 % of "empty" values. "Empty" means in my case that the values are for example: ' ', 'NULL' or 0. There are strings (c1, c2, c4) as well as integers (c3).
The result should be a dataframe with columns c1 and c4 only.
c1 c4
0 Test1 NULL
1 Test2 Test2
2 NULL Test1
3 Test3 Test1
4 Test2
5 Test4 Test2
6 Test4 Test1
7 Test1 Test1
8 Test3 Test2
I have no idea how to handle that problem. Only thing that comes to my mind is something like
df.loc[:, (df != 0).any(axis=0)]
to delete all columns where all values are 0, 'NULL' and so on.
Upvotes: 19
Views: 42328
Reputation: 199
Below given solution is very small and fast(in performance)
Step:1 we are finding percentage of null value in every column
Step:2 we are finding column names in list having more than 60% null values
Step:3 Drop columns having more than 60% null values
import pandas as pd
data = {
'c1': ['Test1','Test2','NULL','Test3',' ','Test4','Test4','Test1',"Test3"],
'c2': [' ','Test1',' ','NULL',' ','NULL','NULL','NULL','NULL'],
'c3': [0,0,0,0,0,1,5,0,0],
'c4': ['NULL', 'Test2', 'Test1','Test1', 'Test2', 'Test2','Test1','Test1','Test2']
}
df = pd.DataFrame(data)
# Below code gives percentage of null in every column
null_percentage = df.isnull().sum()/df.shape[0]*100
# Below code gives list of columns having more than 60% null
col_to_drop = null_percentage[null_percentage>60].keys()
output_df = df.drop(col_to_drop, axis=1)
Upvotes: 4
Reputation: 7838
you can drop the columns using dropna thresh
parameter:
In [58]: df = df.replace([0,' ','NULL'],np.nan)
In[59]: df
Out[59]:
c1 c2 c3 c4
0 Test1 NaN NaN NaN
1 Test2 Test1 NaN Test2
2 NaN NaN NaN Test1
3 Test3 NaN NaN Test1
4 NaN NaN NaN Test2
5 Test4 NaN 1.0 Test2
6 Test4 NaN 5.0 Test1
7 Test1 NaN NaN Test1
8 Test3 NaN NaN Test2
In [60]: df.dropna(thresh=df.shape[0]*0.6,how='all',axis=1)
Out[60]:
c1 c4
0 Test1 NaN
1 Test2 Test2
2 NaN Test1
3 Test3 Test1
4 NaN Test2
5 Test4 Test2
6 Test4 Test1
7 Test1 Test1
8 Test3 Test2
Upvotes: 21
Reputation: 863116
Use DataFrame.isin
for check all formats and then get mean
for treshold and filter by boolean indexing
with loc
:
print (df.isin([' ','NULL',0]))
c1 c2 c3 c4
0 False True True True
1 False False True False
2 True True True False
3 False True True False
4 True True True False
5 False True False False
6 False True False False
7 False True True False
8 False True True False
print (df.isin([' ','NULL',0]).mean())
c1 0.222222
c2 0.888889
c3 0.777778
c4 0.111111
dtype: float64
df = df.loc[:, df.isin([' ','NULL',0]).mean() < .6]
print (df)
c1 c4
0 Test1 NULL
1 Test2 Test2
2 NULL Test1
3 Test3 Test1
4 Test2
5 Test4 Test2
6 Test4 Test1
7 Test1 Test1
8 Test3 Test2
Upvotes: 25