tegraze
tegraze

Reputation: 25

Sklearn Labelencoder keep encoded values when encoding new dataframe

I'm writing a script that uses 'Local Outlier Factor' algorithm for 'novelty detection'. In this case we need to 'fit' a 'clean/training' dataframe before making predictions. For the algorithm to work, we need to encode the values in the dataframe, for example 'vrrp' to '0' and 'udp' to '2', and so on. For this purpose I use sklearn's LabelEncoder(), which enables me to pass the encoded dataframe into the algorithm.

encoder = LabelEncoder()
dataEnc = dataEnc.apply(encoder.fit_transform)

...

dataframeEnc = dataframeEnc.apply(encoder.fit_transform)

Where 'dataEnc' is the training dataset and 'dataframeEnc' is the dataset for making the predictions.

The problem arises when I try to make predictions with a new dataframe: the encoded values of the 'training' are not the same as the encoded values of the 'predict' dataframe for the same original value.

My objective is to keep the resulting encoded values with reference to the original values when encoding a new dataframe.

When encoding a "Training" dataframe, when encoding the value '10.67.21.254', for example, it always encodes to '23'. However, when encoding a new dataframe (validation dataframe), the same value will result in a different encoded value, in my case it's '1'.

As an example of what I'm expecting is that this:

10.67.21.254       234.1.2.88      0      0     udp  3.472 KB       62

Which encodes to this:

23     153      0      0         4  1254       61          0

Is expected that, for the same original values, it would encode into the same encoded values, however, what I get after encoding it again is:

1       1      0      0         1     2        2          0

What I believe it is doing is attributing new values for each row on the dataset based on the other values of the same dataset.

My question then is: How can I make sure that when encoding the values of the new dataset(predict), that I get the same encoded values as in the previous (training) dataset?

Upvotes: 1

Views: 944

Answers (1)

KRKirov
KRKirov

Reputation: 4004

The custom transformer should help. You would have to create a loop and create a dictionary of encoders if you want to transform the whole data frame.

import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin


class TTLabelEncoder(BaseEstimator, TransformerMixin):
    """Transform data frame columns with different categorical values
    in training and test data. TT stands for Train-Test

    Pass individual data frame columns to the class instance"""

    def __init__(self):
        self.code_dic = None
        self.max_code = None
        self.fitted = False

    def fit(self, df):
        self.code_dict = dict(zip(df.unique(),
                                  np.arange(len(df.unique()))))
        self.__max_code__()
        self.fitted = True
        return self

    def transform(self, df):
        assert self.fitted == True, 'Fit the data before transforming.'
        new_cat = set(df.unique()).difference(set(self.code_dict.keys()))
        if new_cat:
            new_codes = dict(zip(new_cat, 
                     np.arange(len(new_cat)) + self.max_code + 1))
            self.code_dict.update(new_codes)
            self.__max_code__()
        return df.map(self.code_dict)

    def __max_code__(self):
        self.max_code = max(self.code_dict.values())
        return self

    def fit_transform(self, df):
        if self.fitted == False:
            self.fit(df)
        df = self.transform(df)
        return df

df_1 = pd.DataFrame({'IP': np.random.choice(list('ABCD'), size=5),
                   'Counts': np.random.randint(10, 20, size=5)})

df_2 = pd.DataFrame({'IP': np.random.choice(list('DEF'), size=5),
                     'Counts': np.random.randint(10, 20, size=5)})

df_3 = pd.DataFrame({'IP': np.random.choice(list('XYZ'), size=5),
                     'Counts': np.random.randint(10, 20, size=5)})

ip_encoder = TTLabelEncoder()
ip_encoder.fit(df_1['IP'])
ip_encoder.code_dict

df_1['IP'] = ip_encoder.transform(df_1['IP'])
df_2['IP'] = ip_encoder.transform(df_2['IP'])
df_3['IP'] = ip_encoder.fit_transform(df_3['IP'])

Output:

 df_1 #Before transformation
Out[54]: 
  IP  Counts
0  D      11
1  C      16
2  B      14
3  A      15
4  D      14

df_1 #After transformation
Out[58]: 
   IP  Counts
0   0      11
1   1      16
2   2      14
3   3      15
4   0      14

df_2 #Before transformation
Out[62]: 
  IP  Counts
0  F      15
1  D      10
2  E      19
3  F      18
4  F      14

df_2 #After transformation
Out[64]: 
   IP  Counts
0   4      15
1   0      10
2   5      19
3   4      18
4   4      14

df_3 #Before tranformation
Out[66]: 
  IP  Counts
0  X      19
1  Z      18
2  X      12
3  X      13
4  Y      18

df_3
Out[68]: #After tranformation
   IP  Counts
0   7      19
1   6      18
2   7      12
3   7      13
4   8      18

ip_encoder.code_dict
Out[69]: {'D': 0, 'C': 1, 'B': 2, 'A': 3, 'F': 4, 'E': 5, 'Z': 6, 'X': 7, 'Y': 8}

Upvotes: 1

Related Questions