Reputation: 578
I am trying to encode some categorical features to be able to use them as features in a machine learning model, at the moment I have the following code:
data_path = '/Users/novikov/Assignment2/epl-training.csv'
data = pd.read_csv(data_path)
data['Date'] = pd.to_datetime(data['Date'])
le = preprocessing.LabelEncoder()
data['HomeTeam'] = le.fit_transform(data.HomeTeam.values)
data['AwayTeam'] = le.fit_transform(data.AwayTeam.values)
data['FTR'] = le.fit_transform(data.FTR.values)
data['HTR'] = le.fit_transform(data.HTR.values)
data['Referee'] = le.fit_transform(data.Referee.values)
This works fine, however this is not ideal because if there were 100 features to encode, it would take way too long to do it by hand. How do I automate the process? I have tried implementing a loop:
label_encode = ['HomeTeam', 'AwayTeam', 'FTR', 'HTR', 'Referee']
for feature in label_encode:
method = 'data.' + feature + '.values'
data[feature] = le.fit_transform(method)
But I get ValueError: bad input shape ()
:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-11-1b8fb6164d2d> in <module>()
11 method = 'data.' + feature + '.values'
12 print(method)
---> 13 data[feature] = le.fit_transform(method)
/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/label.py in fit_transform(self, y)
109 y : array-like of shape [n_samples]
110 """
--> 111 y = column_or_1d(y, warn=True)
112 self.classes_, y = np.unique(y, return_inverse=True)
113 return y
/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
612 return np.ravel(y)
613
--> 614 raise ValueError("bad input shape {0}".format(shape))
615
616
ValueError: bad input shape ()
None of the variations of this code (like just putting data.feature.values
) seem to work. There must be a way of doing it other than writing it by hand.
Upvotes: 3
Views: 1633
Reputation: 294536
The way the encoder object works is that when you fit
it stores some meta data in the object's attributes. These attributes get used when you want to transform the data. fit_transform
is a convenience method to fit
and transform
in one step.
When you decide to use the same object to do another fit_transform
, you are overwriting the the stored meta data. That is fine if you don't want to use the objects inverse_transform
.
df = pd.DataFrame({
'HomeTeam':[1, 3, 27],
'AwayTeam':[9, 8, 100],
'FTR':['dog', 'cat', 'dog'],
'HTR': [*'XYY'],
'Referee': [*'JJB']
})
update
and apply
le = preprocessing.LabelEncoder()
label_encode = ['HomeTeam', 'AwayTeam', 'FTR', 'HTR', 'Referee']
df.update(df[label_encode].apply(le.fit_transform))
df
AwayTeam FTR HTR HomeTeam Referee
0 1 1 0 0 1
1 0 0 1 1 1
2 2 1 1 2 0
Each separate encoder is captured in the le
dictionary for potential later use
from collections import defaultdict
le = defaultdict(preprocessing.LabelEncoder)
label_encode = ['HomeTeam', 'AwayTeam', 'FTR', 'HTR', 'Referee']
df = df.assign(**{k: le[k].fit_transform(df[k]) for k in label_encode})
df
AwayTeam FTR HTR HomeTeam Referee
0 1 1 0 0 1
1 0 0 1 1 1
2 2 1 1 2 0
pandas.factorize
If you just want codes, you can use Pandas' factorize
. Note that this will not sort the final values and labels them in the order they first appear.
df.update(df[label_encode].apply(lambda x: x.factorize()[0]))
df
AwayTeam FTR HTR HomeTeam Referee
0 0 0 0 0 0
1 1 1 1 1 0
2 2 0 1 2 1
unique
This does sort the final values and will look like LabelEncoder
df.update(df[label_encode].apply(lambda x: np.unique(x, return_inverse=True)[1]))
AwayTeam FTR HTR HomeTeam Referee
0 1 1 0 0 1
1 0 0 1 1 1
2 2 1 1 2 0
Upvotes: 4
Reputation: 323386
I am just fixing your code adding pd.eval
label_encode = ['HomeTeam', 'AwayTeam', 'FTR', 'HTR', 'Referee']
for feature in label_encode:
method = 'data.' + feature + '.values'
data[feature] = le.fit_transform(pd.eval(method))
Upvotes: 2
Reputation: 428
It's a little awkward but you access the values from the series and then call fit transform on that, while selecting the series inside the for loop "X[c]=" to indicate you want to assign values back to the DF.
X = pd.DataFrame({
'A':[1, 3, 27],
'B':[9, 8, 100],
'C':['dog', 'cat', 'dog']})
print(X.head())
le = LabelEncoder()
for c in X.columns:
X[c] = le.fit_transform(X[c].values)
X.head()
Upvotes: 0
Reputation: 60400
Of course, method = 'data.' + feature + '.values'
will not work - it is a string itself! Try instead
method = data[feature].values
or
for feature in label_encode:
data[feature] = le.fit_transform(data[feature].values)
Upvotes: 4