Nasri
Nasri

Reputation: 545

LabelEncoder that keeps missing values as 'NaN'

I am rying to use the label encoder in orrder to convert categorical data into numeric values.

I needed a LabelEncoder that keeps my missing values as 'NaN' to use an Imputer afterwards. So I would like to use a mask to replace form the original data frame after labelling like this

df = pd.DataFrame({'A': ['x', np.NaN, 'z'], 'B': [1, 6, 9], 'C': [2, 1, np.NaN]})


    A   B   C
0   x   1   2.0
1   NaN 6   1.0
2   z   9   NaN


dfTmp = df
mask = dfTmp.isnull()

       A    B   C
0   False   False   False
1   True    False   False
2   False   False   True

So I get a dataframe with True/false value

Then , in create the encoder :

df = df.astype(str).apply(LabelEncoder().fit_transform)

How can I proceed then, in orfer to encoder these values?

thanks

Upvotes: 8

Views: 9032

Answers (1)

Mikhail Stepanov
Mikhail Stepanov

Reputation: 3790

The first question is: do you wish to encode each column separately or encode them all with one encoding?

The expression df = df.astype(str).apply(LabelEncoder().fit_transform) implies that you encode all the columns separately.

That case you can do the following:
df = df.apply(lambda series: pd.Series(
    LabelEncoder().fit_transform(series[series.notnull()]),
    index=series[series.notnull()].index
))
print(df)
Out:
     A  B    C
0  0.0  0  1.0
1  NaN  1  0.0
2  1.0  2  NaN

the explenation how it works below. But, for starters, I'll tell about a couple of drawbacks of this solution.

Drawbacks
First, there are a mixed types of columns: if a column contains a NaN value, then column has a type float, because nan's are floats in python.

df.dtypes
A    float64
B      int64
C    float64
dtype: object

It seems to be meaningless for labels. Okay, later you can ignore all the nan's and covert the rest to integer.

The second point is: probably you need to memorize a LabelEncoder - because often it's required to do, for instance, inverse transform. But this solution doesn't memorize encoders, you have no such varaible.

A simple, explicit solution is:

encoders = dict()

for col_name in df.columns:
    series = df[col_name]
    label_encoder = LabelEncoder()
    df[col_name] = pd.Series(
        label_encoder.fit_transform(series[series.notnull()]),
        index=series[series.notnull()].index
    )
    encoders[col_name] = label_encoder

print(df)
Out:
     A  B    C
0  0.0  0  1.0
1  NaN  1  0.0
2  1.0  2  NaN

- more code, but result is the same

print(encoders)
Out
{'A': LabelEncoder(), 'B': LabelEncoder(), 'C': LabelEncoder()}

- also, encoders are available. Inverse transform (should drop nan's before!) too:

encoders['B'].inverse_transform(df['B'])
Out:
array([1, 6, 9])

Also, some options like some registry superclass for encoders also available and they are compatible with the first solution, but easier to iterate through a columns.

How it works

The df.apply(lambda series: ...) applies a function which returns pd.Series to each column; so, it returns a dataframe with a new values.

Expression step by step:

pd.Series(
    LabelEncoder().fit_transform(series[series.notnull()]),
    index=series[series.notnull()].index
)

- series[series.notnull()] drop NaN values, then feeds the rest to the fit_transform.

- as the label encoder returns a numpy.array and throws out an index, index=series[series.notnull()].index restores it to concatenate it correctly. If don't do indexing:

print(df)
Out:
     A  B    C
0    x  1  2.0
1  NaN  6  1.0
2    z  9  NaN
df = df.apply(lambda series: pd.Series(
    LabelEncoder().fit_transform(series[series.notnull()]),
))
print(df)
Out:
     A  B    C
0  0.0  0  1.0
1  1.0  1  0.0
2  NaN  2  NaN

- values shift from correct positions - and even an IndexError may occur.

Single encoder for all columns

That case, stack dataframe, fit encodet, then unstack it

series_stack = df.stack().astype(str)
label_encoder = LabelEncoder()
df = pd.Series(
    label_encoder.fit_transform(series_stack),
    index=series_stack.index
).unstack()
print(df)
Out:
     A    B    C
0  5.0  0.0  2.0
1  NaN  3.0  1.0
2  6.0  4.0  NaN

- as the series_stack is pd.Series containing NaN's, all values from the DataFrame is floats, so you may prefer to convert it.

Hope it helps.

Upvotes: 14

Related Questions