Xavier
Xavier

Reputation: 257

Getting ValueError: y contains new labels when using scikit learn's LabelEncoder

I have a series like:

df['ID'] = ['ABC123', 'IDF345', ...]

I'm using scikit's LabelEncoder to convert it to numerical values to be fed into the RandomForestClassifier.

During the training, I'm doing as follows:

le_id = LabelEncoder()
df['ID'] = le_id.fit_transform(df.ID) 

But, now for testing/prediction, when I pass in new data, I want to transform the 'ID' from this data based on le_id i.e., if same values are present then transform it according to the above label encoder, otherwise assign a new numerical value.

In the test file, I was doing as follows:

new_df['ID'] = le_dpid.transform(new_df.ID)

But, I'm getting the following error: ValueError: y contains new labels

How do I fix this?? Thanks!

UPDATE:

So the task I have is to use the below (for example) as training data and predict the 'High', 'Mod', 'Low' values for new BankNum, ID combinations. The model should learn the characteristics where a 'High' is given, where a 'Low' is given from the training dataset. For example, below a 'High' is given when there are multiple entries with same BankNum and different IDs.

df = 

BankNum   | ID    | Labels

0098-7772 | AB123 | High
0098-7772 | ED245 | High
0098-7772 | ED343 | High
0870-7771 | ED200 | Mod
0870-7771 | ED100 | Mod
0098-2123 | GH564 | Low

And then predict it on something like:

BankNum   |  ID | 

00982222  | AB999 | 
00982222  | AB999 |
00981111  | AB890 |

I'm doing something like this:

df['BankNum'] = df.BankNum.astype(np.float128)

    le_id = LabelEncoder()
    df['ID'] = le_id.fit_transform(df.ID)

X_train, X_test, y_train, y_test = train_test_split(df[['BankNum', 'ID'], df.Labels, test_size=0.25, random_state=42)
    clf = RandomForestClassifier(random_state=42, n_estimators=140)
    clf.fit(X_train, y_train)

Upvotes: 15

Views: 66638

Answers (10)

Sachin Modi
Sachin Modi

Reputation: 11

I faced the similar issue. I understand that you have to do encoding for each column and save pickle file for each technically.

As you are aware LabelEncoder() works on single column at a time. Suppose you applied fit_transform on each column in loop and created a pickle file.

It would only remember fit from last column not for others.

Upvotes: 1

This is in fact a known bug on LabelEncoder : BUG for fit_transform ... basically you have to fit it and then transform. It will work fine ! A suggestion is to keep a dictionary of your encoders to each and every column so that in the inverse transform you are able to retrieve the original categorical values.

Upvotes: 1

anmol
anmol

Reputation: 3

This error comes when transform function getting any new value for which LabelEncoder try to encode and because in training samples, when you are using fit_transform, that specific value did not present in the corpus. So there is a hack, whether use all the unique values with fit_transform function if you are sure that no new value will come further, or try some different encoding method which suits on the problem statement like HashingEncoder.

Here is the example if no further new values will come in testing

le_id.fit_transform(list(set(df['ID'].unique()).union(set(new_df['ID'].unique())))) 
new_df['ID'] = le_id.transform(new_df.ID)

Upvotes: 0

Black_Hat
Black_Hat

Reputation: 11

I found an easy hack around this issue.

Assuming X is the dataframe of features,

  1. First, we need to create a list of dicts which would have the key as the iterable starting from 0 and the corresponding value pair would be the categorical column name. We easily accomplish this using enum.

    cat_cols_enum = list(enumerate(X.select_dtypes(include = ['O']).columns))

  2. Then the idea is to create a list of label encoders whose dimension is equal to the number of qualitative(categorical) columns present in the dataframe X.

    le = [LabelEncoder() for i in range(len(cat_cols_enum))]

  3. Next and the last part would be fitting each of the label encoders present in the list of encoders with the unique values of each of the categorical columns present in the list of dicts respectively.

    for i in cat_cols_enum: le[i[0]].fit(X[i[1]].value_counts().index)

Now, we can transform the labels to their respective encodings using

for i in cat_cols_enum:
X[i[1]] = le[i[0]].transform(X[i[1]])

Upvotes: 1

Samuel Tosan Ayo
Samuel Tosan Ayo

Reputation: 379

I hope this helps someone as it's more recent.

sklearn uses the fit_transform to perform the fit function and transform function directing on label encoding. To solve the problem for Y label throwing error for unseen values, use:

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()     
le.fit_transform(Col) 

This solves it!

Upvotes: -2

Marco Cerliani
Marco Cerliani

Reputation: 22021

If your data are pd.DataFrame I suggest you this simple solution...

I build a custom transformer that integer maps each categorical feature. When fitted you can transform all the data you want. You can compute also simple label encoding or onehot encoding.

If new unseen categories or NaNs are present in new data:

1] For label encoding, 0 is a special token reserved for mapping these cases.

2] For onehot encoding, all the onehot columns are zeros in these cases.

class FeatureTransformer:
    
    def __init__(self, categorical_features):
        self.categorical_features = categorical_features
        
    def fit(self, X):

        if not isinstance(X, pd.DataFrame):
            raise ValueError("Pass a pandas.DataFrame")
            
        if not isinstance(self.categorical_features, list):
            raise ValueError(
                "Pass categorical_features as a list of column names")
                    
        self.encoding = {}
        for c in self.categorical_features:

            _, int_id = X[c].factorize()
            self.encoding[c] = dict(zip(list(int_id), range(1,len(int_id)+1)))
            
        return self

    def transform(self, X, onehot=True):

        if not isinstance(X, pd.DataFrame):
            raise ValueError("Pass a pandas.DataFrame")

        if not hasattr(self, 'encoding'):
            raise AttributeError("FeatureTransformer must be fitted")
            
        df = X.drop(self.categorical_features, axis=1)
        
        if onehot:  # one-hot encoding
            for c in sorted(self.categorical_features):            
                categories = X[c].map(self.encoding[c]).values
                for val in self.encoding[c].values():
                    df["{}_{}".format(c,val)] = (categories == val).astype('int16')
        else:       # label encoding
            for c in sorted(self.categorical_features):
                df[c] = X[c].map(self.encoding[c]).fillna(0)
            
        return df

Usage:

X_train = pd.DataFrame(np.random.randint(10,20, (100,10)))
X_test = pd.DataFrame(np.random.randint(20,30, (100,10)))

ft = FeatureTransformer(categorical_features=[0,1,3])
ft.fit(X_train)

ft.transform(X_test, onehot=False).shape

Upvotes: 2

bshelt141
bshelt141

Reputation: 1223

I'm able to mentally process operations better when dealing in DataFrames. The approach below fits and transforms the LabelEncoder() using the training data, then uses a series of pd.merge joins to map the trained fit/transform encoder values to the test data. When there is a test data value not seen in the training data, the code defaults to the max trained encoder value + 1.

# encode class values as integers
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
encoder.fit(y_train)
encoded_y_train = encoder.transform(y_train)

# make a dataframe of the unique train values and their corresponding encoded integers
y_map = pd.DataFrame({'y_train': y_train, 'encoded_y_train': encoded_y_train})
y_map = y_map.drop_duplicates()

# map the unique test values to the trained encoded integers
y_test_df = pd.DataFrame({'y_test': y_test})
y_test_unique = y_test_df.drop_duplicates()
y_join = pd.merge(y_test_unique, y_map, 
                  left_on = 'y_test', right_on = 'y_train', 
                  how = 'left')

# if the test category is not found in the training category group, then make the 
# value the maximum value of the training group + 1                  
y_join['encoded_y_test'] = np.where(y_join['encoded_y_train'].isnull(), 
                                    y_map.shape[0] + 1, 
                                    y_join['encoded_y_train']).astype('int')

encoded_y_test = pd.merge(y_test_df, y_join, on = 'y_test', how = 'left') \
    .encoded_y_test.values

Upvotes: 0

Arun Ganesan
Arun Ganesan

Reputation: 15

I used

       le.fit_transform(Col) 

and I was able to resolve the issue. It does fit and transform both. we dont need to worry about unknown values in the test split

Upvotes: -4

Yury Wallet
Yury Wallet

Reputation: 1650

you can try solution from "sklearn.LabelEncoder with never seen before values" https://stackoverflow.com/a/48169252/9043549 The thing is to create dictionary with classes, than map column and fill new classes with some "known value"

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
suf="_le"
col="a"
df[col+suf] = le.fit_transform(df[col])
dic = dict(zip(le.classes_, le.transform(le.classes_)))
col='b'
df[col+suf]=df[col].map(dic).fillna(dic["c"]).astype(int) 

Upvotes: 4

zimmerrol
zimmerrol

Reputation: 4951

I think the error message is very clear: Your test dataset contains ID labels which have not been included in your training data set. For this items, the LabelEncoder can not find a suitable numeric value to represent. There are a few ways to solve this problem. You can either try to balance your data set, so that you are sure that each label is not only present in your test but also in your training data. Otherwise, you can try to follow one of the ideas presented here.

One of the possibles solutions is, that you search through your data set at the beginning, get a list of all unique ID values, train the LabelEncoder on this list, and keep the rest of your code just as it is at the moment.

An other possible solution is, to check that the test data have only labels which have been seen in the training process. If there is a new label, you have to set it to some fallback value like unknown_id (or something like this). Doin this, you put all new, unknown IDs in one class; for this items the prediction will then fail, but you can use the rest of your code as it is now.

Upvotes: 11

Related Questions