Reputation: 25
I'm writing a script that uses 'Local Outlier Factor' algorithm for 'novelty detection'. In this case we need to 'fit' a 'clean/training' dataframe before making predictions. For the algorithm to work, we need to encode the values in the dataframe, for example 'vrrp' to '0' and 'udp' to '2', and so on. For this purpose I use sklearn's LabelEncoder(), which enables me to pass the encoded dataframe into the algorithm.
encoder = LabelEncoder()
dataEnc = dataEnc.apply(encoder.fit_transform)
...
dataframeEnc = dataframeEnc.apply(encoder.fit_transform)
Where 'dataEnc' is the training dataset and 'dataframeEnc' is the dataset for making the predictions.
The problem arises when I try to make predictions with a new dataframe: the encoded values of the 'training' are not the same as the encoded values of the 'predict' dataframe for the same original value.
My objective is to keep the resulting encoded values with reference to the original values when encoding a new dataframe.
When encoding a "Training" dataframe, when encoding the value '10.67.21.254', for example, it always encodes to '23'. However, when encoding a new dataframe (validation dataframe), the same value will result in a different encoded value, in my case it's '1'.
As an example of what I'm expecting is that this:
10.67.21.254 234.1.2.88 0 0 udp 3.472 KB 62
Which encodes to this:
23 153 0 0 4 1254 61 0
Is expected that, for the same original values, it would encode into the same encoded values, however, what I get after encoding it again is:
1 1 0 0 1 2 2 0
What I believe it is doing is attributing new values for each row on the dataset based on the other values of the same dataset.
My question then is: How can I make sure that when encoding the values of the new dataset(predict), that I get the same encoded values as in the previous (training) dataset?
Upvotes: 1
Views: 944
Reputation: 4004
The custom transformer should help. You would have to create a loop and create a dictionary of encoders if you want to transform the whole data frame.
import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
class TTLabelEncoder(BaseEstimator, TransformerMixin):
"""Transform data frame columns with different categorical values
in training and test data. TT stands for Train-Test
Pass individual data frame columns to the class instance"""
def __init__(self):
self.code_dic = None
self.max_code = None
self.fitted = False
def fit(self, df):
self.code_dict = dict(zip(df.unique(),
np.arange(len(df.unique()))))
self.__max_code__()
self.fitted = True
return self
def transform(self, df):
assert self.fitted == True, 'Fit the data before transforming.'
new_cat = set(df.unique()).difference(set(self.code_dict.keys()))
if new_cat:
new_codes = dict(zip(new_cat,
np.arange(len(new_cat)) + self.max_code + 1))
self.code_dict.update(new_codes)
self.__max_code__()
return df.map(self.code_dict)
def __max_code__(self):
self.max_code = max(self.code_dict.values())
return self
def fit_transform(self, df):
if self.fitted == False:
self.fit(df)
df = self.transform(df)
return df
df_1 = pd.DataFrame({'IP': np.random.choice(list('ABCD'), size=5),
'Counts': np.random.randint(10, 20, size=5)})
df_2 = pd.DataFrame({'IP': np.random.choice(list('DEF'), size=5),
'Counts': np.random.randint(10, 20, size=5)})
df_3 = pd.DataFrame({'IP': np.random.choice(list('XYZ'), size=5),
'Counts': np.random.randint(10, 20, size=5)})
ip_encoder = TTLabelEncoder()
ip_encoder.fit(df_1['IP'])
ip_encoder.code_dict
df_1['IP'] = ip_encoder.transform(df_1['IP'])
df_2['IP'] = ip_encoder.transform(df_2['IP'])
df_3['IP'] = ip_encoder.fit_transform(df_3['IP'])
Output:
df_1 #Before transformation
Out[54]:
IP Counts
0 D 11
1 C 16
2 B 14
3 A 15
4 D 14
df_1 #After transformation
Out[58]:
IP Counts
0 0 11
1 1 16
2 2 14
3 3 15
4 0 14
df_2 #Before transformation
Out[62]:
IP Counts
0 F 15
1 D 10
2 E 19
3 F 18
4 F 14
df_2 #After transformation
Out[64]:
IP Counts
0 4 15
1 0 10
2 5 19
3 4 18
4 4 14
df_3 #Before tranformation
Out[66]:
IP Counts
0 X 19
1 Z 18
2 X 12
3 X 13
4 Y 18
df_3
Out[68]: #After tranformation
IP Counts
0 7 19
1 6 18
2 7 12
3 7 13
4 8 18
ip_encoder.code_dict
Out[69]: {'D': 0, 'C': 1, 'B': 2, 'A': 3, 'F': 4, 'E': 5, 'Z': 6, 'X': 7, 'Y': 8}
Upvotes: 1