Reputation: 451
I try to encode a number of columns containing categorical data ("Yes"
and "No"
) in a large pandas dataframe. The complete dataframe contains over 400 columns so I look for a way to encode all desired columns without having to encode them one by one. I use Scikit-learn LabelEncoder
to encode the categorical data.
The first part of the dataframe does not have to be encoded, however I am looking for a method to encode all the desired columns containing categorical date directly without split and concatenate the dataframe.
To demonstrate my question I first tried to solve it on a small part of the dataframe. However get stuck at the last part where the data is fitted and transformed and get a ValueError: bad input shape (4,3)
. The code as I ran:
# Create a simple dataframe resembling large dataframe
data = pd.DataFrame({'A': [1, 2, 3, 4],
'B': ["Yes", "No", "Yes", "Yes"],
'C': ["Yes", "No", "No", "Yes"],
'D': ["No", "Yes", "No", "Yes"]})
# Import required module
from sklearn.preprocessing import LabelEncoder
# Create an object of the label encoder class
labelencoder = LabelEncoder()
# Apply labelencoder object on columns
labelencoder.fit_transform(data.ix[:, 1:]) # First column does not need to be encoded
Complete error report:
labelencoder.fit_transform(data.ix[:, 1:])
Traceback (most recent call last):
File "<ipython-input-47-b4986a719976>", line 1, in <module>
labelencoder.fit_transform(data.ix[:, 1:])
File "C:\Anaconda\Anaconda3\lib\site-packages\sklearn\preprocessing\label.py", line 129, in fit_transform
y = column_or_1d(y, warn=True)
File "C:\Anaconda\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 562, in column_or_1d
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (4, 3)
Does anyone know how to do this?
Upvotes: 16
Views: 36154
Reputation: 81
Here is the Simplest I could write:
Step 1: Get all categorical columns:
categorical_columns = train.select_dtypes(['object']).columns
This will store all categorical columns.
Step2: Write a for loop to transform, as fit_transform only takes 1 index at a time. but here is the crack.
from sklearn.preprocessing import LabelEncoder
label_encoder = preprocessing.LabelEncoder()
for col in train[categorical_columns]:
train[col]= label_encoder.fit_transform(train[col])
Step3: Vote up lol :)
Hope you find this useful.
Upvotes: 1
Reputation: 666
If you know the name of the columns and don't want to use all of them, you can do something like this (you are also getting rid of a for loop):
categ = ['Pclass','Cabin_Group','Ticket','Embarked']
# Encode Categorical Columns
le = LabelEncoder()
df[categ] = df[categ].apply(le.fit_transform)
Upvotes: 4
Reputation: 31
You can also loop through the different columns you want to apply the encoding to. This method might not the most efficient, but it works fine.
from sklearn import preprocessing
LE = preprocessing.LabelEncoder()
for col in df.columns:
df[col] = LE.fit(df[col])
df[col] = LE.transform(df[col])
test_data[col] = LE.transform(test_data[col])
Upvotes: 1
Reputation: 838
First, find out all the features with type object:
objList = all_data.select_dtypes(include = "object").columns
print (objList)
Now, to convert the above objList features into numeric type, you can use a forloop as given below:
#Label Encoding for object to numeric conversion
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
for feat in objList:
df[feat] = le.fit_transform(df[feat].astype(str))
print (df.info())
Note that we are explicitly mentioning as type string in the forloop because if you remove that it throws an error.
Upvotes: 5
Reputation: 412
Scikit-learn has something for this now: OrdinalEncoder
from sklearn.preprocessing import OrdinalEncoder
data = pd.DataFrame({'A': [1, 2, 3, 4],
'B': ["Yes", "No", "Yes", "Yes"],
'C': ["Yes", "No", "No", "Yes"],
'D': ["No", "Yes", "No", "Yes"]})
oe = OrdinalEncoder()
t_data = oe.fit_transform(data)
print(t_data)
# [[0. 1. 1. 0.]
# [1. 0. 0. 1.]
# [2. 1. 0. 0.]
# [3. 1. 1. 1.]]
Works straight out of the box.
Upvotes: 3
Reputation: 11
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelBinarizer
# df is the pandas dataframe
class preprocessing (BaseEstimator, TransformerMixin):
def __init__ (self, df):
self.datatypes = df.dtypes.astype(str)
self.catcolumns = []
self.cat_encoders = []
self.encoded_df = []
def fit (self, df, y = None):
for ix, val in zip(self.datatypes.index.values,
self.datatypes.values):
if val =='object':
self.catcolumns.append(ix)
fit_objs = [str(i) for i in range(len(self.catcolumns))]
for encs, name in zip(fit_objs,self.catcolumns):
encs = LabelBinarizer()
encs.fit(df[name])
self.cat_encoders.append((name, encs))
return self
def transform (self, df , y = None):
for name, encs in self.cat_encoders:
df_c = encs.transform(df[name])
self.encoded_df.append(pd.DataFrame(df_c))
self.encoded_df = pd.concat(self.encoded_df, axis = 1,
ignore_index
= True)
self.df_num = df.drop(self.catcolumns, axis = 1)
y = pd.concat([self.df_num, self.encoded_df], axis = 1,
ignore_index = True)
return y
# use return y.values to use in sci-kit learn pipeline
""" Finds categorical columns in a dataframe and one hot encodes the
columns. you can replace labelbinarizer with labelencoder if you
require only label encoding. Function returns encoded categorcial data
and numerical data as a dataframe """
Upvotes: 1
Reputation: 8823
As the following code, you can encode the multiple columns by applying LabelEncoder
to DataFrame. However, please note that we cannot obtain the classes information for all columns.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame({'A': [1, 2, 3, 4],
'B': ["Yes", "No", "Yes", "Yes"],
'C': ["Yes", "No", "No", "Yes"],
'D': ["No", "Yes", "No", "Yes"]})
print(df)
# A B C D
# 0 1 Yes Yes No
# 1 2 No No Yes
# 2 3 Yes No No
# 3 4 Yes Yes Yes
# LabelEncoder
le = LabelEncoder()
# apply "le.fit_transform"
df_encoded = df.apply(le.fit_transform)
print(df_encoded)
# A B C D
# 0 0 1 1 0
# 1 1 0 0 1
# 2 2 1 0 0
# 3 3 1 1 1
# Note: we cannot obtain the classes information for all columns.
print(le.classes_)
# ['No' 'Yes']
Upvotes: 22