Bluetail
Bluetail

Reputation: 1291

Using OrdinalEnconder() to transform columns with predefined numerical values

I have a dataframe like this:

import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

df = pd.DataFrame({'department': ['operations','operations','support','logics', 'sales'],
                   'salary': ["low", "medium", "medium", "high", "high"],
                   'tenure': [5,6,6,8,5],
                  })
df


   department  salary  tenure
0  operations     low       5
1  operations  medium       6
2     support  medium       6
3      logics    high       8
4       sales    high       5

I want to encode the salary feature as ['low', 1], ['Medium', 2], ['High', 3]. Or, ['low', 0], ['Medium', 1], ['High', 2] - not sure if the exact values make a difference for the further use in a classification algorithm such as a logistic regression in scikit-learn.

however, I am not getting them ordered correctly after applying OrdinalEncoder() - where the salary is 'high' I am getting a '0' while it should be '2'.

oe = OrdinalEncoder()
df[["salary"]] = oe.fit_transform(df[["salary"]])
df

    department  salary  tenure
0   operations  1.0     5
1   operations  2.0     6
2   support     2.0     6
3   logics      0.0     8
4   sales       0.0     5

I know that I can use df["salary"] = df["salary"].replace(0,3) but I'm hoping maybe someone can advise of a more direct way to do it. thank you.

Upvotes: 0

Views: 386

Answers (3)

Antoine Dubuis
Antoine Dubuis

Reputation: 5304

If you want to perform this operation using OrdinalEncoder, you can use the categories parameter to specify the ordering.

As follows:

OrdinalEncoder(categories=[['low', 'medium', 'high']]).fit_transform(df[['salary']])

Output:

array([[0.],
       [1.],
       [1.],
       [2.],
       [2.]])

Upvotes: 2

user7864386
user7864386

Reputation:

As @BENY says, you can stay in pandas and do what you want. factorize is great if "low" appears first, "medium" second and "high" third in the data (as shown in your example). If that's not the case, factorize may not produce what you want.

A possible solution is to create a dictionary that maps salary levels to numbers and use map:

mapper = dict([['low', 1], ['medium', 2], ['high', 3]])
df['salary'] = df['salary'].map(mapper)

Output:

   department  salary  tenure
0  operations       1       5
1  operations       2       6
2     support       2       6
3      logics       3       8
4       sales       3       5

Upvotes: 1

BENY
BENY

Reputation: 323266

You can just stay with pandas factorize

df['new'] = df.salary.factorize()[0]
#Out[276]: array([0, 1, 1, 2, 2], dtype=int64)

Upvotes: 0

Related Questions