JoeBoggs
JoeBoggs

Reputation: 301

sklearn - how to incorporate missing data when one-hot encoding

I'm trying to keep rows in a dataset that contain missing data.

When one-hot encoding a column (or multiple columns) with sklearn. Is it possible to write a rule that if currentItem == null or if currentItem == 0 then set the output array to all 0s?

e.g.

A A B -> [[1, 0], [1, 0], [0,1]]

B B A -> [[0, 1], [0, 1], [1,0]]

null B A -> [[0, 0], [0, 1], [1,0]]


one-hot encoding:

import numpy as np
from sklearn.preprocessing import LabelEncoder


dataset = np.loadtxt("someFile.csv", delimiter=",")
B = dataset[:,1]

encoder = LabelEncoder()
encoder.fit(B)
encoded_B = encoder.transform(B)

Y = to_categorical(encoded_B)

EDIT - Example Dataset: Where A-E are inputs and X & Y and outputs

A     B     C     D     E     X      Y
7     6     3     3     2     11     4
5     6     0     0     7     15     7
3     3     9     null  7     12     7
7     null  7     null  7     12     13
null  7     4     6     12    13     4
null  5     7     6     null  14     7
2     6     0     0     2     13     3
7     null  7     null  2     13     7

Upvotes: 8

Views: 16698

Answers (3)

Awais Mushtaq
Awais Mushtaq

Reputation: 1112

I would suggest you to replace nan value with 'none' that will introduce additional column i.e df_encoding_variables = df_encoding_variables.replace(np.nan,'None')

Upvotes: 0

Kris Stern
Kris Stern

Reputation: 1340

Or, you can try to use the pandas fillna() method. (Source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) Let's say you have a DataFrame called df. Then, you can do:

df = df.fillna(0)

to convert all NaN in df into zeros, before passing it through one-hot encoding.

Upvotes: 0

cs95
cs95

Reputation: 402443

If you have pandas, this is pretty simple.

s = pd.Series(['A', 'A', 0, 'B', 0, 'A', np.nan])
s

0      A
1      A
2      0
3      B
4      0
5      A
6    NaN
dtype: object

Use replace to convert 0 to NaN -

s = s.replace({0 : np.nan, '0' : np.nan})
s

0      A
1      A
2    NaN
3      B
4    NaN
5      A
6    NaN
dtype: object

Now, call pd.get_dummies, which ignores NaN values.

pd.get_dummies(s)

   A  B
0  1  0
1  1  0
2  0  0
3  0  1
4  0  0
5  1  0
6  0  0

The solution is the same for a dataframe.

Upvotes: 11

Related Questions