Reputation: 1291
I have a dataframe like this:
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
df = pd.DataFrame({'department': ['operations','operations','support','logics', 'sales'],
'salary': ["low", "medium", "medium", "high", "high"],
'tenure': [5,6,6,8,5],
})
df
department salary tenure
0 operations low 5
1 operations medium 6
2 support medium 6
3 logics high 8
4 sales high 5
I want to encode the salary feature as ['low', 1], ['Medium', 2], ['High', 3]. Or, ['low', 0], ['Medium', 1], ['High', 2] - not sure if the exact values make a difference for the further use in a classification algorithm such as a logistic regression in scikit-learn.
however, I am not getting them ordered correctly after applying OrdinalEncoder() - where the salary is 'high' I am getting a '0' while it should be '2'.
oe = OrdinalEncoder()
df[["salary"]] = oe.fit_transform(df[["salary"]])
df
department salary tenure
0 operations 1.0 5
1 operations 2.0 6
2 support 2.0 6
3 logics 0.0 8
4 sales 0.0 5
I know that I can use df["salary"] = df["salary"].replace(0,3) but I'm hoping maybe someone can advise of a more direct way to do it. thank you.
Upvotes: 0
Views: 386
Reputation: 5304
If you want to perform this operation using OrdinalEncoder
, you can use the categories
parameter to specify the ordering.
As follows:
OrdinalEncoder(categories=[['low', 'medium', 'high']]).fit_transform(df[['salary']])
Output:
array([[0.],
[1.],
[1.],
[2.],
[2.]])
Upvotes: 2
Reputation:
As @BENY says, you can stay in pandas and do what you want. factorize
is great if "low" appears first, "medium" second and "high" third in the data (as shown in your example). If that's not the case, factorize
may not produce what you want.
A possible solution is to create a dictionary that maps salary levels to numbers and use map
:
mapper = dict([['low', 1], ['medium', 2], ['high', 3]])
df['salary'] = df['salary'].map(mapper)
Output:
department salary tenure
0 operations 1 5
1 operations 2 6
2 support 2 6
3 logics 3 8
4 sales 3 5
Upvotes: 1
Reputation: 323266
You can just stay with pandas factorize
df['new'] = df.salary.factorize()[0]
#Out[276]: array([0, 1, 1, 2, 2], dtype=int64)
Upvotes: 0