Reputation:
I am trying to delete columns containing a certain percentage of missing values. Below is a working example:
raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', np.nan, 'Ali', 'Milner', 'Cooze'],
'age': [42, '' , '', '', 73],
'sex': ['m', np.nan, 'f', 'm', 'f'],
'preTestScore': [4, np.nan, np.nan, 2, 3],
'postTestScore': [25, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age',
'sex', 'preTestScore', 'postTestScore'])
df
first_name last_name age sex preTestScore postTestScore
0 Jason Miller 42 m 4.0 25.0
1 NaN NaN NaN NaN NaN
2 Tina Ali f NaN NaN
3 Jake Milner m 2.0 62.0
4 Amy Cooze 73 f 3.0 70.0
df = df.dropna(thresh=0.7*len(df), axis=1)
df
first_name last_name age sex
0 Jason Miller 42 m
1 NaN NaN NaN
2 Tina Ali f
3 Jake Milner m
4 Amy Cooze 73 f
How may I drop the 'age' column as well? I have spent hours using drop.na, trying to put in zeros in the empty cells. I just can't figure out as to how to detect the missing cells in the 'age' column.
Upvotes: 2
Views: 2348
Reputation: 151
Another way to drop columns having certain percentage of missing values :
df = df.drop(columns= [x for x in df if round((df[x].isna().sum()/len(df)*100),2) > 30 ])
Upvotes: 0
Reputation: 1
how about this using dropna from pandas:
def drop_columns(data, threshold):
return(data.dropna(axis = 1, thresh = (len(data) * (1-threshold))))
(this is my first time answering so sorry if I'm not nailing the etiquette)
Upvotes: 0
Reputation: 4580
First replace ' '
or (Blanks) with NaN
then use dropna()
df = df.replace({'':np.nan})
df
first_name last_name age sex preTestScore postTestScore
0 Jason Miller 42.0 m 4.0 25.0
1 NaN NaN NaN NaN NaN NaN
2 Tina Ali NaN f NaN NaN
3 Jake Milner NaN m 2.0 62.0
4 Amy Cooze 73.0 f 3.0 70.0
You can use following function to check missing value %
def missing(dff):
print("Missing values in %")
print(round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))
missing(df)
Missing values in %
age 60.0
postTestScore 40.0
preTestScore 40.0
sex 20.0
last_name 20.0
first_name 20.0
dtype: float64
Let's say, you want to drop all the columns having 60% or more missing values
df = df.drop(df.loc[:,list((100*(df.isnull().sum()/len(df))>=60))].columns, 1)
first_name last_name sex preTestScore postTestScore
0 Jason Miller m 4.0 25.0
1 NaN NaN NaN NaN NaN
2 Tina Ali f NaN NaN
3 Jake Milner m 2.0 62.0
4 Amy Cooze f 3.0 70.0
Note: Age column (60 % missing values) is dropped.
Upvotes: 2
Reputation: 323316
You need replace
, then dropna
df=df.replace({'':np.nan})
df = df.dropna(thresh=0.7*len(df), axis=1)
df
Out[858]:
first_name last_name sex
0 Jason Miller m
1 NaN NaN NaN
2 Tina Ali f
3 Jake Milner m
4 Amy Cooze f
Upvotes: 4