pythonpandasmachine-learningencodingscikit-learn

Reputation: 578

How to encode multiple features at once with SciKit Learn transform

I am trying to encode some categorical features to be able to use them as features in a machine learning model, at the moment I have the following code:

data_path = '/Users/novikov/Assignment2/epl-training.csv'
data = pd.read_csv(data_path)
data['Date'] = pd.to_datetime(data['Date'])

le = preprocessing.LabelEncoder()


data['HomeTeam'] = le.fit_transform(data.HomeTeam.values)
data['AwayTeam'] = le.fit_transform(data.AwayTeam.values)
data['FTR'] = le.fit_transform(data.FTR.values)
data['HTR'] = le.fit_transform(data.HTR.values)
data['Referee'] = le.fit_transform(data.Referee.values)

This works fine, however this is not ideal because if there were 100 features to encode, it would take way too long to do it by hand. How do I automate the process? I have tried implementing a loop:

label_encode = ['HomeTeam', 'AwayTeam', 'FTR', 'HTR', 'Referee']

for feature in label_encode:
    method = 'data.' + feature + '.values'
    data[feature] = le.fit_transform(method)

But I get ValueError: bad input shape ():

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-1b8fb6164d2d> in <module>()
     11     method = 'data.' + feature + '.values'
     12     print(method)
---> 13     data[feature] = le.fit_transform(method)

/anaconda3/lib/python3.6/site-packages/sklearn/preprocessing/label.py in fit_transform(self, y)
    109         y : array-like of shape [n_samples]
    110         """
--> 111         y = column_or_1d(y, warn=True)
    112         self.classes_, y = np.unique(y, return_inverse=True)
    113         return y

/anaconda3/lib/python3.6/site-packages/sklearn/utils/validation.py in column_or_1d(y, warn)
    612         return np.ravel(y)
    613 
--> 614     raise ValueError("bad input shape {0}".format(shape))
    615 
    616 

ValueError: bad input shape ()

None of the variations of this code (like just putting data.feature.values) seem to work. There must be a way of doing it other than writing it by hand.

Upvotes: 3

Answers (4)

piRSquared

Reputation: 294536

The way the encoder object works is that when you fit it stores some meta data in the object's attributes. These attributes get used when you want to transform the data. fit_transform is a convenience method to fit and transform in one step.

When you decide to use the same object to do another fit_transform, you are overwriting the the stored meta data. That is fine if you don't want to use the objects inverse_transform.

Setup

df = pd.DataFrame({
    'HomeTeam':[1, 3, 27],
    'AwayTeam':[9, 8, 100],
    'FTR':['dog', 'cat', 'dog'],
    'HTR': [*'XYY'],
    'Referee': [*'JJB']
})

Answer to your question

update and apply

le = preprocessing.LabelEncoder()
label_encode = ['HomeTeam', 'AwayTeam', 'FTR', 'HTR', 'Referee']

df.update(df[label_encode].apply(le.fit_transform))
df

   AwayTeam FTR HTR  HomeTeam Referee
0         1   1   0         0       1
1         0   0   1         1       1
2         2   1   1         2       0

How I'd Do It

Each separate encoder is captured in the le dictionary for potential later use

from collections import defaultdict
le = defaultdict(preprocessing.LabelEncoder)
label_encode = ['HomeTeam', 'AwayTeam', 'FTR', 'HTR', 'Referee']

df = df.assign(**{k: le[k].fit_transform(df[k]) for k in label_encode})
df

   AwayTeam FTR HTR  HomeTeam Referee
0         1   1   0         0       1
1         0   0   1         1       1
2         2   1   1         2       0

`pandas.factorize`

If you just want codes, you can use Pandas' factorize. Note that this will not sort the final values and labels them in the order they first appear.

df.update(df[label_encode].apply(lambda x: x.factorize()[0]))
df

   AwayTeam FTR HTR  HomeTeam Referee
0         0   0   0         0       0
1         1   1   1         1       0
2         2   0   1         2       1

Numpy's `unique`

This does sort the final values and will look like LabelEncoder

df.update(df[label_encode].apply(lambda x: np.unique(x, return_inverse=True)[1]))

   AwayTeam FTR HTR  HomeTeam Referee
0         1   1   0         0       1
1         0   0   1         1       1
2         2   1   1         2       0

Upvotes: 4

BENY

Reputation: 323386

I am just fixing your code adding pd.eval

label_encode = ['HomeTeam', 'AwayTeam', 'FTR', 'HTR', 'Referee']

for feature in label_encode:
    method = 'data.' + feature + '.values'
    data[feature] = le.fit_transform(pd.eval(method))

Upvotes: 2

Dylan

Reputation: 428

It's a little awkward but you access the values from the series and then call fit transform on that, while selecting the series inside the for loop "X[c]=" to indicate you want to assign values back to the DF.

X = pd.DataFrame({
    'A':[1, 3, 27],
    'B':[9, 8, 100],
    'C':['dog', 'cat', 'dog']})
print(X.head())

le = LabelEncoder()

for c in X.columns:

    X[c] = le.fit_transform(X[c].values)

X.head()

Upvotes: 0

desertnaut

Reputation: 60400

Of course, method = 'data.' + feature + '.values' will not work - it is a string itself! Try instead

method = data[feature].values

for feature in label_encode:
    data[feature] = le.fit_transform(data[feature].values)