mar
mar

Reputation: 333

How to categorize two categories in one dataframe in Pandas

I have one pd including two categorical columns with 150 categories. May be a value in column A is not appeared in Column B. For example

a = pd.DataFrame({'A':list('bbaba'),  'B':list('cccaa')})
a['A'] = a['A'].astype('category')
a['B'] = a['B'].astype('category')

The output is

Out[217]: 
   A  B
0  b  c
1  b  c
2  a  c
3  b  a
4  a  a

And also

cat_columns = a.select_dtypes(['category']).columns
a[cat_columns] = a[cat_columns].apply(lambda x: x.cat.codes)
a

The output is

Out[220]: 
   A  B
0  1  1
1  1  1
2  0  1
3  1  0
4  0  0

My problem is that in column A, the b is considered as 1, but in column B, the c is considered as 1. However, I want something like this:

Out[220]: 
   A  B
0  1  2
1  1  2
2  0  2
3  1  0
4  0  0

which 2 is considered as c.

Please note that I have 150 different labels.

Upvotes: 2

Views: 1605

Answers (3)

jpp
jpp

Reputation: 164783

If you are only interested in converting to categorical codes and being able to access the mapping via a dictionary, pd.factorize may be more convenient.

Algorithm for getting unique values across columns via @AlexRiley.

a = pd.DataFrame({'A':list('bbaba'),  'B':list('cccaa')})

fact = dict(zip(*pd.factorize(pd.unique(a[['A', 'B']].values.ravel('K')))[::-1]))

b = a.applymap(fact.get)

Result:

   A  B
0  0  2
1  0  2
2  1  2
3  0  1
4  1  1

Upvotes: 0

piRSquared
piRSquared

Reputation: 294506

We can use pd.factorize all at once.

pd.DataFrame(
    pd.factorize(a.values.ravel())[0].reshape(a.shape),
    a.index, a.columns
)

   A  B
0  0  1
1  0  1
2  2  1
3  0  2
4  2  2

Or if you wanted to factorize by sorted category value, use the sort=True argument

pd.DataFrame(
    pd.factorize(a.values.ravel(), True)[0].reshape(a.shape),
    a.index, a.columns
)

   A  B
0  1  2
1  1  2
2  0  2
3  1  0
4  0  0

Or equivalently with np.unique

pd.DataFrame(
    np.unique(a.values.ravel(), return_inverse=True)[1].reshape(a.shape),
    a.index, a.columns
)

   A  B
0  1  2
1  1  2
2  0  2
3  1  0
4  0  0

Upvotes: 1

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210942

Using pd.Categorical() you can specify a list of categories:

In [44]: cats = a[['A','B']].stack().sort_values().unique()

In [45]: cats
Out[45]: array(['a', 'b', 'c'], dtype=object)

In [46]: a['A'] = pd.Categorical(a['A'], categories=cats)

In [47]: a['B'] = pd.Categorical(a['B'], categories=cats)

In [48]: a[cat_columns] = a[cat_columns].apply(lambda x: x.cat.codes)

In [49]: a
Out[49]:
   A  B
0  1  2
1  1  2
2  0  2
3  1  0
4  0  0

Upvotes: 4

Related Questions