Reputation: 301
I'm trying to keep rows in a dataset that contain missing data.
When one-hot encoding a column (or multiple columns) with sklearn. Is it possible to write a rule that if currentItem == null
or if currentItem == 0
then set the output array to all 0s?
e.g.
A A B
-> [[1, 0], [1, 0], [0,1]]
B B A
-> [[0, 1], [0, 1], [1,0]]
null B A
-> [[0, 0], [0, 1], [1,0]]
one-hot encoding:
import numpy as np
from sklearn.preprocessing import LabelEncoder
dataset = np.loadtxt("someFile.csv", delimiter=",")
B = dataset[:,1]
encoder = LabelEncoder()
encoder.fit(B)
encoded_B = encoder.transform(B)
Y = to_categorical(encoded_B)
EDIT - Example Dataset: Where A-E are inputs and X & Y and outputs
A B C D E X Y
7 6 3 3 2 11 4
5 6 0 0 7 15 7
3 3 9 null 7 12 7
7 null 7 null 7 12 13
null 7 4 6 12 13 4
null 5 7 6 null 14 7
2 6 0 0 2 13 3
7 null 7 null 2 13 7
Upvotes: 8
Views: 16698
Reputation: 1112
I would suggest you to replace nan value with 'none' that will introduce additional column
i.e df_encoding_variables = df_encoding_variables.replace(np.nan,'None')
Upvotes: 0
Reputation: 1340
Or, you can try to use the pandas fillna() method. (Source: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) Let's say you have a DataFrame called df
. Then, you can do:
df = df.fillna(0)
to convert all NaN
in df
into zeros, before passing it through one-hot encoding.
Upvotes: 0
Reputation: 402443
If you have pandas, this is pretty simple.
s = pd.Series(['A', 'A', 0, 'B', 0, 'A', np.nan])
s
0 A
1 A
2 0
3 B
4 0
5 A
6 NaN
dtype: object
Use replace
to convert 0
to NaN -
s = s.replace({0 : np.nan, '0' : np.nan})
s
0 A
1 A
2 NaN
3 B
4 NaN
5 A
6 NaN
dtype: object
Now, call pd.get_dummies
, which ignores NaN values.
pd.get_dummies(s)
A B
0 1 0
1 1 0
2 0 0
3 0 1
4 0 0
5 1 0
6 0 0
The solution is the same for a dataframe.
Upvotes: 11