Reputation: 757
I have a dataset which requires label encoding. I am using sklearn's label encoder for the same.
Here is the reproducible code for the problem:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
data11 = pd.DataFrame({'Transaction_Type': ['Mortgage', 'Credit reporting', 'Consumer Loan', 'Mortgage'],
'Complaint_reason': ['Incorrect Info', 'False Statement', 'Using a Debit Card', 'Payoff process'],
'Company_response': ['Response1', 'Response2', 'Response3', 'Response1'],
'Consumer_disputes': ['Yes', 'No', 'No', 'Yes'],
'Complaint_Status': ['Processing','Closed', 'Awaiting Response', 'Closed']
})
le = LabelEncoder()
data11['Transaction_Type'] = le.fit_transform(data11['Transaction_Type'])
data11['Complaint_reason'] = le.transform(data11['Complaint_reason'])
data11['Company_response'] = le.fit_transform(data11['Company_response'])
data11['Consumer_disputes'] = le.transform(data11['Consumer_disputes'])
data11['Complaint_Status'] = le.transform(data11['Complaint_Status'])
The desired output should be something like:
({'Transaction_Type': ['1', '2', '3', '1'],
'Complaint_reason': ['1', '2', '3', '4'],
'Company_response': ['1', '2', '3', '1'],
'Consumer_disputes': ['1', '2', '2', '1'],
'Complaint_Status': ['1','2', '3', '2']
})
The problem is when I try to encode the columns: 'Transaction_Type' and 'Company_response' get encoded successfully but the columns 'Complaint_reason', 'Consumer_disputes' and 'Complaint_Status' throw errors.
For 'Complaint_reason':
File "C:/Users/Ashu/untitled0.py", line 26, in <module>
data11['Complaint_reason'] = le.transform(data11['Complaint_reason'])
ValueError: y contains new labels: ['APR or interest rate' 'Account opening, closing, or management'
'Account terms and changes' ...
"Was approved for a loan, but didn't receive the money"
'Written notification about debt' 'Wrong amount charged or received']
and similarly for 'Consumer_disputes':
File "<ipython-input-117-9625bd78b740>", line 1, in <module>
data11['Consumer_disputes'] = le.transform(data11['Consumer_disputes'].astype(str))
ValueError: y contains new labels: ['No' 'Yes']
and similarly for 'Complaint_Status':
File "<ipython-input-119-5cd289c72e45>", line 1, in <module>
data11['Complaint_Status'] = le.transform(data11['Complaint_Status'])
ValueError: y contains new labels: ['Closed' 'Closed with explanation' 'Closed with monetary relief'
'Closed with non-monetary relief' 'Untimely response']
These all are categorical variables with fixed inputs in forms of sentences. Following is the data slice image:
Categorical Data Label Encoding
There are a couple of questions on this on SO but none have been answered successfully.
Upvotes: 2
Views: 3970
Reputation: 1101
You are missing fit_transform() and that's why you are getting error.
sklearn.preprocessing.LabelEncoder -> Encode labels with value between 0 and n_classes-1 (from official docs)
Still if you want to encode your classes between 1 and n_classes, you just need to add 1.
data11['Transaction_Type'] = le.fit_transform(data11['Transaction_Type'])
data11['Transaction_Type']
Output:
0 2
1 1
2 0
3 2
Name: Transaction_Type, dtype: int64
Notice here, LabelEncoder() do encoding in an alphabetical order, it will give a label of 0 to Consumer Loan which comes first in alphabetical order. Similarly, it gives a label of 2 to Mortage which comes last in order.
Now, you have two ways to encode it, either accept the default output of LabelEncoder like this,
data11['Transaction_Type'] = le.fit_transform(data11['Transaction_Type'])
data11['Complaint_reason'] = le.fit_transform(data11['Complaint_reason'])
data11['Company_response'] = le.fit_transform(data11['Company_response'])
data11['Consumer_disputes'] = le.fit_transform(data11['Consumer_disputes'])
data11['Complaint_Status'] = le.fit_transform(data11['Complaint_Status'])
Output:
Transaction_Type Complaint_reason Company_response Consumer_disputes Complaint_Status
0 2 1 0 1 2
1 1 0 1 0 1
2 0 3 2 0 0
3 2 2 0 1 1
OR
data11['Transaction_Type'] = le.fit_transform(data11['Transaction_Type'].sort_values()) + 1
data11['Complaint_reason'] = le.fit_transform(data11['Complaint_reason'].sort_values()) + 1
data11['Company_response'] = le.fit_transform(data11['Company_response']) + 1
data11['Consumer_disputes'] = le.fit_transform(data11['Consumer_disputes'].sort_values()) + 1
data11['Complaint_Status'] = le.fit_transform(data11['Complaint_Status'].sort_values()) + 1
Output:
Transaction_Type Complaint_reason Company_response Consumer_disputes Complaint_Status
0 1 1 1 1 1
1 2 2 2 1 2
2 3 3 3 2 2
3 3 4 1 2 3
Upvotes: 1
Reputation: 21709
Since all columns are different, I think you need to initialise le
for each column:
for col in data11.columns:
le = LabelEncoder()
data11[col] = le.fit_transform(data11[col])
Transaction_Type Complaint_reason Company_response Consumer_disputes \
0 2 1 0 1
1 1 0 1 0
2 0 3 2 0
3 2 2 0 1
Complaint_Status
0 2
1 1
2 0
3 1
Upvotes: 0