pythonpandasdataframescikit-learnmissing-data

Reputation: 6475

How to remove columns with too many missing values in Python

I'm working on a machine learning problem in which there are many missing values in the features. There are 100's of features and I would like to remove those features that have too many missing values (it can be features with more than 80% missing values). How can I do that in Python?

My data is a Pandas dataframe.

Upvotes: 12

Answers (9)

ricecooker

Reputation: 91

One thing about dropna() according to the documentation: the thresh argument specifies the number of non-NaNs to keep.

Upvotes: 0

mannem srinivas

Reputation: 121

def show_null_columns(data, agg, threshold):
    if agg == 'sum':
       null_cols = data.isnull().sum()
    elif agg == 'mean':
       null_cols = data.isnull().mean()
    columns = data.columns
    null_dic = {}
    for col,x in zip(columns, null_cols):
        if x>= threshold:
            null_dic[col] = x
    return null_dic

null_dic = show_null_columns(train, 'mean', 0.8)
train2 = train.drop(null_dic.keys(), axis=1)

Upvotes: 1

Rajat Rai

Reputation: 1

Use:

df = df[df.isnull().sum(axis=1) <= 5]

Here we remove the missing values from the rows having greater than five missing values.

Upvotes: 0

BP34500

Reputation: 178

The fastest way to find the sum of NaN or the percentage by columns is:

for the sum: df.isna().sum()
for the percentage: df.isna().mean()

Upvotes: 2

Suhas_Pote

Reputation: 4580

Here is a simple function which you can use directly by passing a dataframe and a threshold

def rmissingvaluecol(dff, threshold):
    l = []
    l = list(dff.drop(dff.loc[:,list((100*(dff.isnull().sum()/len(dff.index)) >= threshold))].columns, 1).columns.values)
    print("# Columns having more than %s percent missing values: "%threshold, (dff.shape[1] - len(l)))
    print("Columns:\n", list(set(list((dff.columns.values))) - set(l)))
    return l


rmissingvaluecol(df,80) # Here threshold is 80% which means we are going to drop columns having more than 80% of missing values

# Output
'''
# Columns having more than 60 percent missing values: 2
Columns:
 ['id', 'location']
'''

Now create a new dataframe excluding these columns:

l = rmissingvaluecol(df, 49)
df1 = df[l]

Bonus step

You can find the percentage of missing values for each column (optional)

def missing(dff):
    print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))

missing(df)

# Output
'''
id          83.33
location    83.33
owner       16.67
pets        16.67
dtype: float64
'''

Upvotes: 4

singmotor

Reputation: 4180

You can use Pandas' dropna().

limitPer = len(yourdf) * .80
yourdf = yourdf.dropna(thresh=limitPer, axis=1)

Upvotes: 20

Anton vBR

Reputation: 18916

Following MaxU's example, this is the option for filtering rows:

    df = pd.DataFrame(np.random.choice([2,np.nan], (5,10), p=[0.2, 0.8]), columns=list('abcdefghij'))

        a    b    c    d    e    f    g    h    i    j
    0   NaN  NaN  NaN  NaN  NaN  2.0  NaN  NaN  NaN  2.0
    1   NaN  2.0  NaN  2.0  NaN  NaN  2.0  NaN  NaN  2.0
    2   NaN  NaN  2.0  NaN  2.0  NaN  2.0  2.0  NaN  NaN
    3   NaN  NaN  NaN  NaN  NaN  2.0  NaN  NaN  NaN  2.0
    4   2.0  2.0  2.0  NaN  NaN  NaN  NaN  NaN  NaN  NaN

Rows

    df.loc[df.isnull().mean(axis=1).lt(0.8)]

        a    b    c    d    e    f    g    h    i    j
    1   NaN  2.0  NaN  2.0  NaN  NaN  2.0  NaN  NaN  2.0
    2   NaN  NaN  2.0  NaN  2.0  NaN  2.0  2.0  NaN  NaN
    4   2.0  2.0  2.0  NaN  NaN  NaN  NaN  NaN  NaN  NaN

Upvotes: 5

vielkind

Reputation: 2980

To generalize within Pandas you can do the following to calculate the percent of values in a column with missing values. From those columns you can filter out the features with more than 80% NULL values and then drop those columns from the DataFrame.

pct_null = df.isnull().sum() / len(df)
missing_features = pct_null[pct_null > 0.80].index
df.drop(missing_features, axis=1, inplace=True)

Upvotes: 4

MaxU - stand with Ukraine

Reputation: 210872

Demo:

Setup:

In [105]: df = pd.DataFrame(np.random.choice([2,np.nan], (20, 5), p=[0.2, 0.8]), columns=list('abcde'))

In [106]: df
Out[106]:
      a    b    c    d    e
0   NaN  2.0  NaN  NaN  NaN
1   NaN  NaN  2.0  NaN  2.0
2   NaN  2.0  NaN  NaN  NaN
3   NaN  NaN  NaN  NaN  2.0
4   NaN  2.0  2.0  NaN  NaN
5   NaN  NaN  NaN  NaN  NaN
6   NaN  2.0  NaN  NaN  NaN
7   2.0  2.0  NaN  NaN  NaN
8   2.0  2.0  NaN  NaN  NaN
9   NaN  NaN  NaN  NaN  NaN
10  NaN  2.0  2.0  NaN  2.0
11  NaN  NaN  NaN  2.0  NaN
12  2.0  NaN  NaN  2.0  NaN
13  NaN  NaN  NaN  2.0  NaN
14  NaN  NaN  NaN  2.0  2.0
15  NaN  NaN  NaN  NaN  NaN
16  NaN  2.0  NaN  NaN  NaN
17  2.0  NaN  NaN  NaN  2.0
18  NaN  NaN  NaN  2.0  NaN
19  NaN  2.0  NaN  2.0  NaN

In [107]: df.isnull().mean()
Out[107]:
a    0.80
b    0.55
c    0.85
d    0.70
e    0.75
dtype: float64

Solution:

In [108]: df.columns[df.isnull().mean() < 0.8]
Out[108]: Index(['b', 'd', 'e'], dtype='object')

In [109]: df[df.columns[df.isnull().mean() < 0.8]]
Out[109]:
      b    d    e
0   2.0  NaN  NaN
1   NaN  NaN  2.0
2   2.0  NaN  NaN
3   NaN  NaN  2.0
4   2.0  NaN  NaN
5   NaN  NaN  NaN
6   2.0  NaN  NaN
7   2.0  NaN  NaN
8   2.0  NaN  NaN
9   NaN  NaN  NaN
10  2.0  NaN  2.0
11  NaN  2.0  NaN
12  NaN  2.0  NaN
13  NaN  2.0  NaN
14  NaN  2.0  2.0
15  NaN  NaN  NaN
16  2.0  NaN  NaN
17  NaN  NaN  2.0
18  NaN  2.0  NaN
19  2.0  2.0  NaN

Upvotes: 29

How to remove columns with too many missing values in Python

Answers (9)

Here is a simple function which you can use directly by passing a dataframe and a threshold

Bonus step

Rows

Related Questions