return the labels and their encoded values in sklearn LabelEncoder

I'm using LabelEncoder and OneHotEncoder from sklearn in a Machine Learning project to encode the labels (country names) in the dataset. Everything works good and my model runs perfectly. The project is to classify whether a bank customer will continue with or leave the bank based on a number of features(data), including the customer's country.

My issue arises when I want to predict (classify) a new customer (one only). The data for the new customer is still not pre-processed (i.e., country names are not encoded). Something like the following:

new_customer = np.array([['France', 600, 'Male', 40, 3, 60000, 2, 1,1, 50000]])

In the online course, where I learn machine learning, the instructor opened the pre-processed dataset that included the encoded data and manually checked the code for France and updated it in the new_customer, as the following:

new_customer = np.array([[0, 0, 600, 'Male', 40, 3, 60000, 2, 1,1, 50000]])

I believe that this is not practical, there must be a way to automatically encode France to the same code used in the original dataset, or at least a way to return a list of the countries and their encoded values. Manually encoding a label seems tedious and error-prone. So how can I automate this process, or generate the codes for the labels? Thanks in advance.

Upvotes: 6

Answers (4)

Visakh Sr

Reputation: 1

To get the classes back from sklearn.preprocessing.LabelEncoder, use the .classes_ attribute of it.

# Labels 
labels = ["background", "objective", "method", "result", "conclusion"]

# Initialize and fit the LabelEncoder
encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(labels)

print(encoder.classes_)

Upvotes: 0

Pasindu Perera

Reputation: 587

The problem is you didn't encode the country attribute of your dataset.

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# define example
data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 
'hot']
values = array(data)
print(values)
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print(integer_encoded)
# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
print(onehot_encoded)

output :-

['cold' 'cold' 'warm' 'cold' 'hot' 'hot' 'warm' 'cold' 'warm' 'hot']
[0 0 2 0 1 1 2 0 2 1]
[[1. 0. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]

For your problem, this data = ['cold', 'cold', 'warm', 'cold', 'hot', 'hot', 'warm', 'cold', 'warm', 'hot'] should be your dataset's country attribute. Then you can choose the integer or binary encoding method. Then continue the learning process.

Upvotes: 0

Brad Solomon

Reputation: 40948

It seems like you may be looking for the .transform() method of your estimator.

>>> from sklearn.preprocessing import LabelEncoder

>>> c = ['France', 'UK', 'US', 'US', 'UK', 'China', 'France']
>>> enc = LabelEncoder().fit(c)
>>> encoded = enc.transform(c)
>>> encoded
array([1, 2, 3, 3, 2, 0, 1])

>>> encoded.transform(['France'])
array([1])

This takes the "mapping" that was learned when you called fit(c) and applies it to new data (in this case, a new label). You can see this mapping in reverse:

>>> enc.inverse_transform(encoded)
array(['France', 'UK', 'US', 'US', 'UK', 'China', 'France'], dtype='<U6')

As mentioned by the answer here, if you want to do this between Python sessions, you could serialize the estimator to disk like this:

import pickle

with open('enc.pickle', 'wb') as file:
    pickle.dump(enc, file, pickle.HIGHEST_PROTOCOL)

Then load this in a new session and transform incoming data with it.

Upvotes: 12

Learning is a mess

Reputation: 8277

In machine learning it is a custom to keep the preprocessing pipeline in memory so that, after picking its hyperparameters and training the model, you can apply the same preprocessing on the test data.

If all of that is run in the same python instance, as is common for small/middle size projects, then it means keeping your LabelEncoder online or not sending it to garbage collection. In case of running training and testing in different instances, I think the easiest solution is to store it on disk, and load it in the testing script.

I advise you to use pickle. Here is an example.

Upvotes: 1

return the labels and their encoded values in sklearn LabelEncoder

Answers (4)

Related Questions