Reputation: 333
I have one pd including two categorical columns with 150 categories. May be a value in column A
is not appeared in Column B
. For example
a = pd.DataFrame({'A':list('bbaba'), 'B':list('cccaa')})
a['A'] = a['A'].astype('category')
a['B'] = a['B'].astype('category')
The output is
Out[217]:
A B
0 b c
1 b c
2 a c
3 b a
4 a a
And also
cat_columns = a.select_dtypes(['category']).columns
a[cat_columns] = a[cat_columns].apply(lambda x: x.cat.codes)
a
The output is
Out[220]:
A B
0 1 1
1 1 1
2 0 1
3 1 0
4 0 0
My problem is that in column A
, the b
is considered as 1
, but in column B
, the c
is considered as 1
. However, I want something like this:
Out[220]:
A B
0 1 2
1 1 2
2 0 2
3 1 0
4 0 0
which 2
is considered as c
.
Please note that I have 150 different labels.
Upvotes: 2
Views: 1605
Reputation: 164783
If you are only interested in converting to categorical codes and being able to access the mapping via a dictionary, pd.factorize
may be more convenient.
Algorithm for getting unique values across columns via @AlexRiley.
a = pd.DataFrame({'A':list('bbaba'), 'B':list('cccaa')})
fact = dict(zip(*pd.factorize(pd.unique(a[['A', 'B']].values.ravel('K')))[::-1]))
b = a.applymap(fact.get)
Result:
A B
0 0 2
1 0 2
2 1 2
3 0 1
4 1 1
Upvotes: 0
Reputation: 294506
We can use pd.factorize
all at once.
pd.DataFrame(
pd.factorize(a.values.ravel())[0].reshape(a.shape),
a.index, a.columns
)
A B
0 0 1
1 0 1
2 2 1
3 0 2
4 2 2
Or if you wanted to factorize by sorted category value, use the sort=True
argument
pd.DataFrame(
pd.factorize(a.values.ravel(), True)[0].reshape(a.shape),
a.index, a.columns
)
A B
0 1 2
1 1 2
2 0 2
3 1 0
4 0 0
Or equivalently with np.unique
pd.DataFrame(
np.unique(a.values.ravel(), return_inverse=True)[1].reshape(a.shape),
a.index, a.columns
)
A B
0 1 2
1 1 2
2 0 2
3 1 0
4 0 0
Upvotes: 1
Reputation: 210942
Using pd.Categorical()
you can specify a list of categories:
In [44]: cats = a[['A','B']].stack().sort_values().unique()
In [45]: cats
Out[45]: array(['a', 'b', 'c'], dtype=object)
In [46]: a['A'] = pd.Categorical(a['A'], categories=cats)
In [47]: a['B'] = pd.Categorical(a['B'], categories=cats)
In [48]: a[cat_columns] = a[cat_columns].apply(lambda x: x.cat.codes)
In [49]: a
Out[49]:
A B
0 1 2
1 1 2
2 0 2
3 1 0
4 0 0
Upvotes: 4