Reputation: 141
Consider the following dataframe:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame(data=[["France", "Italy", "Belgium"], ["Italy", "France", "Belgium"]], columns=["a", "b", "c"])
df = df.apply(LabelEncoder().fit_transform)
print(df)
It currently outputs:
a b c
0 0 1 0
1 1 0 0
My goal is to make it output something like this by passing in the columns I want to share categorial values:
a b c
0 0 1 2
1 1 0 2
Upvotes: 5
Views: 5648
Reputation: 402263
You can do this with pd.factorize
.
df = df.stack()
df[:] = pd.factorize(df)[0]
df.unstack()
a b c
0 0 1 2
1 1 0 2
In case you want to encode
only some columns in the dataframe then:
temp = df[['a', 'b']].stack()
temp[:] = temp.factorize()[0]
df[['a', 'b']] = temp.unstack()
a b c
0 0 1 Belgium
1 1 0 Belgium
Upvotes: 2
Reputation: 164613
Here's an alternative solution using categorical data. Similar to @unutbu's but preserves ordering of factorization. In other words, the first value found will have code 0.
df = pd.DataFrame(data=[["France", "Italy", "Belgium"],
["Italy", "France", "Belgium"]],
columns=["a", "b", "c"])
# get unique values in order
vals = df.T.stack().unique()
# convert to categories and then extract codes
for col in df:
df[col] = pd.Categorical(df[col], categories=vals)
df[col] = df[col].cat.codes
print(df)
a b c
0 0 1 2
1 1 0 2
Upvotes: 0
Reputation: 19947
If the encoding order doesn't matter, you can do:
df_new = (
pd.DataFrame(columns=df.columns,
data=LabelEncoder()
.fit_transform(df.values.flatten()).reshape(df.shape))
)
df_new
Out[27]:
a b c
0 1 2 0
1 2 1 0
Upvotes: 1
Reputation: 879103
Pass axis=1
to call LabelEncoder().fit_transform
once for each row.
(By default, df.apply(func)
calls func
once for each column).
import pandas as pd
from sklearn.preprocessing import LabelEncoder
df = pd.DataFrame(data=[["France", "Italy", "Belgium"],
["Italy", "France", "Belgium"]], columns=["a", "b", "c"])
encoder = LabelEncoder()
df = df.apply(encoder.fit_transform, axis=1)
print(df)
yields
a b c
0 1 2 0
1 2 1 0
Alternatively, you could use make the data of category
dtype and use the category codes as labels:
import pandas as pd
df = pd.DataFrame(data=[["France", "Italy", "Belgium"],
["Italy", "France", "Belgium"]], columns=["a", "b", "c"])
stacked = df.stack().astype('category')
result = stacked.cat.codes.unstack()
print(result)
also yields
a b c
0 1 2 0
1 2 1 0
This should be significantly faster since it does not require calling encoder.fit_transform
once for each row (which might give terrible performance if you have lots of rows).
Upvotes: 3