Reputation: 121
If I have two columns as below:
Origin Destination
China USA
China Turkey
USA China
USA Turkey
USA Russia
Russia China
How would I perform label encoding while ensuring the label for the Origin column matches the one in the destination column i.e
Origin Destination
0 1
0 3
1 0
1 0
1 0
2 1
If I do the encoding for each column separately then the algorithm will see the China in column1 as different from column2 which is not the case
Upvotes: 11
Views: 8399
Reputation: 294218
stack
df.stack().pipe(lambda s: pd.Series(pd.factorize(s.values)[0], s.index)).unstack()
Origin Destination
0 0 1
1 0 2
2 1 0
3 1 2
4 1 3
5 3 0
factorize
with reshape
pd.DataFrame(
pd.factorize(df.values.ravel())[0].reshape(df.shape),
df.index, df.columns
)
Origin Destination
0 0 1
1 0 2
2 1 0
3 1 2
4 1 3
5 3 0
np.unique
and reshape
pd.DataFrame(
np.unique(df.values.ravel(), return_inverse=True)[1].reshape(df.shape),
df.index, df.columns
)
Origin Destination
0 0 3
1 0 2
2 3 0
3 3 2
4 3 1
5 1 0
I couldn't stop trying stuff... sorry!
df.applymap(
lambda x, y={}, c=itertools.count():
y.get(x) if x in y else y.setdefault(x, next(c))
)
Origin Destination
0 0 1
1 0 3
2 1 0
3 1 3
4 1 2
5 2 0
As pointed out by cᴏʟᴅsᴘᴇᴇᴅ
You can shorten this by assigning back to dataframe
df[:] = pd.factorize(df.values.ravel())[0].reshape(df.shape)
Upvotes: 8
Reputation: 11602
Edit: just found out about return_inverse
option to np.unique
. No need to search and substitute!
df.values[:] = np.unique(df, return_inverse=True)[1].reshape(-1,2)
You could leverage the vectorized version of np.searchsorted
with
df.values[:] = np.searchsorted(np.sort(np.unique(df)), df)
Or you could create an array of one-hot encodings and recover indices with argmax. Probably not a great idea if there are many countries.
df.values[:] = (df.values[...,None] == np.unique(df)).argmax(-1)
Upvotes: 3
Reputation: 323226
You can using replace
df.replace(dict(zip(np.unique(df.values),list(range(len(np.unique(df.values)))))))
Origin Destination
0 0 3
1 0 2
2 3 0
3 3 2
4 3 1
5 1 0
Succinct and nice answer from Pir
df.replace((lambda u: dict(zip(u, range(u.size))))(np.unique(df)))
And
df.replace(dict(zip(np.unique(df), itertools.count())))
Upvotes: 5
Reputation: 18208
Using LabelEncoder
from sklearn
, you can also try:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df.values.flatten())
df = df.apply(le.fit_transform)
print(df)
Result:
Origin Destination
0 0 3
1 0 2
2 2 0
3 2 2
4 2 1
5 1 0
If you have more columns and only want to apply to selected columns of dataframe then, you can try:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
# columns to select for encoding
selected_col = ['Origin','Destination']
le.fit(df[selected_col].values.flatten())
df[selected_col] = df[selected_col].apply(le.fit_transform)
print(df)
Upvotes: 0
Reputation: 51335
pandas
Method
You could create a dictionary of {country: value}
pairs and map the dataframe to that:
country_map = {country:i for i, country in enumerate(df.stack().unique())}
df['Origin'] = df['Origin'].map(country_map)
df['Destination'] = df['Destination'].map(country_map)
>>> df
Origin Destination
0 0 1
1 0 2
2 1 0
3 1 2
4 1 3
5 3 0
sklearn
method
Since you tagged sklearn
, you could use LabelEncoder()
:
from sklearn.preprocessing import LabelEncoder
le= LabelEncoder()
le.fit(df.stack().unique())
df['Origin'] = le.transform(df['Origin'])
df['Destination'] = le.transform(df['Destination'])
>>> df
Origin Destination
0 0 3
1 0 2
2 3 0
3 3 2
4 3 1
5 1 0
To get the original labels back:
>>> le.inverse_transform(df['Origin'])
# array(['China', 'China', 'USA', 'USA', 'USA', 'Russia'], dtype=object)
Upvotes: 7