jxn
jxn

Reputation: 8025

pandas get mapping of categories to integer value

I can transform categorical columns to their categorical code but how do i get an accurate picture of their mapping? Example:

df_labels = pd.DataFrame({'col1':[1,2,3,4,5], 'col2':list('abcab')})
df_labels['col2'] = df_labels['col2'].astype('category')  

df_labels looks like this:

   col1 col2
0     1    a
1     2    b
2     3    c
3     4    a
4     5    b

How do i get an accurate mapping of the cat codes to cat categories? The stackoverflow response below says to enumerate the categories. However, I'm not sure if enumerating was the way cat.codes generated the integer values. Is there a more accurate way?

Get mapping of categorical variables in pandas

>>> dict( enumerate(df.five.cat.categories) )

{0: 'bad', 1: 'good'}

What is a good way to get the mapping in the above format but accurate?

Upvotes: 7

Views: 14202

Answers (4)

JohnE
JohnE

Reputation: 30414

OP asks for something "accurate" relative to the answer in the linked question:

dict(enumerate(df_labels.col2.cat.categories))

# {0: 'a', 1: 'b', 2: 'c'}

I believe that the above answer is indeed accurate (full disclosure: it is my answer in the other question that I'm defending). Note also that it is roughly equivalent to @pomber's answer, except that the ordering of the keys and values is reversed. (Since both keys and values are unique, the ordering is in some sense irrelevant, and easy enough to reverse as a consequence).

However, the following way is arguably safer, or at least more transparent as to how it works:

dict(zip(df_labels.col2.cat.codes, df_labels.col2))

# {0: 'a', 1: 'b', 2: 'c'}

This is similar in spirit to @boud's answer, but corrects an error by replacing df_labels.col2.cat.codes with df_labels.col2. It also replaces list() with dict() which seems more appropriate for a mapping and automatically gets rid of duplicates.

Note that the length of both arguments to zip() is len(df), whereas the length of df_labels.col2.cat.categories is a count of unique values which will generally be much shorter than len(df).

Also note that this method is quite inefficient as it maps 0 to 'a' twice, and similarly for 'b'. In large dataframes the difference in speed could be pretty big. But it won't cause any error because dict() will remove redundancies like this -- it's just that it will be much less efficient than the other method.

Upvotes: 5

Zeugma
Zeugma

Reputation: 32085

Edited answer (removed cat.categories and changed list to dict):

>>> dict(zip(df_labels.col2.cat.codes, df_labels.col2))

{0: 'a', 1: 'b', 2: 'c'}

The original answer which some of the comments are referring to:

>>> list(zip(df_labels.col2.cat.codes, df_labels.col2.cat.categories))

[(0, 'a'), (1, 'b'), (2, 'c')]

As the comments note, the original answer works in this example because the first three values happend to be [a,b,c], but would fail if they were instead [c,b,a] or [b,c,a].

Upvotes: 6

pomber
pomber

Reputation: 23980

I use:

dict([(category, code) for code, category in enumerate(df_labels.col2.cat.categories)])

# {'a': 0, 'b': 1, 'c': 2}

Upvotes: 6

Neo X
Neo X

Reputation: 957

If you want to convert each column/ data series from categorical back to original, you just need to reverse what you did in the for loop of the dataframe. There are two methods to do that:

  1. To get back to the original Series or numpy array, use Series.astype(original_dtype) or np.asarray(categorical).

  2. If you have already codes and categories, you can use the from_codes()constructor to save the factorize step during normal constructor mode.

See pandas: Categorical Data


Usage of from_codes

As on official documentation, it makes a Categorical type from codes and categories arrays.

splitter = np.random.choice([0,1], 5, p=[0.5,0.5])
s = pd.Series(pd.Categorical.from_codes(splitter, categories=["train", "test"]))
print splitter
print s

gives

[0 1 1 0 0]
0    train
1     test
2     test
3    train
4    train
dtype: category
Categories (2, object): [train, test]

For your codes

# after your previous conversion
print df['col2']
# apply from_codes, the 2nd argument is the categories from mapping dict
s = pd.Series(pd.Categorical.from_codes(df['col2'], list('abcde')))
print s

gives

0    0
1    1
2    2
3    0
4    1
Name: col2, dtype: int8
0    a
1    b
2    c
3    a
4    b
dtype: category
Categories (5, object): [a, b, c, d, e]

Upvotes: 4

Related Questions