user6416338
user6416338

Reputation:

Python Dropping columns containing large percentage of missing values

I am trying to delete columns containing a certain percentage of missing values. Below is a working example:

raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'], 
    'last_name': ['Miller', np.nan, 'Ali', 'Milner', 'Cooze'], 
    'age': [42, '' , '', '', 73], 
    'sex': ['m', np.nan, 'f', 'm', 'f'], 
    'preTestScore': [4, np.nan, np.nan, 2, 3],
    'postTestScore': [25, np.nan, np.nan, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 
    'sex', 'preTestScore', 'postTestScore'])
df
 first_name last_name   age sex preTestScore    postTestScore
 0  Jason   Miller       42  m    4.0             25.0
 1  NaN     NaN              NaN  NaN             NaN
 2  Tina    Ali              f    NaN             NaN
 3  Jake    Milner           m    2.0             62.0
 4  Amy     Cooze       73   f    3.0             70.0

df = df.dropna(thresh=0.7*len(df), axis=1)
df
first_name  last_name   age sex
0   Jason   Miller      42  m
1   NaN     NaN             NaN
2   Tina    Ali             f
3   Jake    Milner          m
4   Amy     Cooze       73  f

How may I drop the 'age' column as well? I have spent hours using drop.na, trying to put in zeros in the empty cells. I just can't figure out as to how to detect the missing cells in the 'age' column.

Upvotes: 2

Views: 2348

Answers (4)

Imam_AI
Imam_AI

Reputation: 151

Another way to drop columns having certain percentage of missing values :

df = df.drop(columns= [x for x in df if round((df[x].isna().sum()/len(df)*100),2) > 30 ]) 

Upvotes: 0

frankosavela
frankosavela

Reputation: 1

how about this using dropna from pandas:

def drop_columns(data, threshold):
    return(data.dropna(axis = 1, thresh = (len(data) * (1-threshold))))

(this is my first time answering so sorry if I'm not nailing the etiquette)

Upvotes: 0

Suhas_Pote
Suhas_Pote

Reputation: 4580

First replace ' ' or (Blanks) with NaN then use dropna()

df = df.replace({'':np.nan})
df

      first_name last_name   age  sex  preTestScore  postTestScore
0      Jason    Miller  42.0    m           4.0           25.0
1        NaN       NaN   NaN  NaN           NaN            NaN
2       Tina       Ali   NaN    f           NaN            NaN
3       Jake    Milner   NaN    m           2.0           62.0
4        Amy     Cooze  73.0    f           3.0           70.0

You can use following function to check missing value %

def missing(dff):
    print("Missing values in %")
    print(round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))
    
missing(df)

Missing values in %
age              60.0
postTestScore    40.0
preTestScore     40.0
sex              20.0
last_name        20.0
first_name       20.0
dtype: float64

Let's say, you want to drop all the columns having 60% or more missing values

df = df.drop(df.loc[:,list((100*(df.isnull().sum()/len(df))>=60))].columns, 1)

  first_name last_name  sex  preTestScore  postTestScore
0      Jason    Miller    m           4.0           25.0
1        NaN       NaN  NaN           NaN            NaN
2       Tina       Ali    f           NaN            NaN
3       Jake    Milner    m           2.0           62.0
4        Amy     Cooze    f           3.0           70.0

Note: Age column (60 % missing values) is dropped.

Upvotes: 2

BENY
BENY

Reputation: 323316

You need replace, then dropna

df=df.replace({'':np.nan})
df = df.dropna(thresh=0.7*len(df), axis=1)
df
Out[858]: 
  first_name last_name  sex
0      Jason    Miller    m
1        NaN       NaN  NaN
2       Tina       Ali    f
3       Jake    Milner    m
4        Amy     Cooze    f

Upvotes: 4

Related Questions