Replacing string elements of a pandas DataFrame with integers

Question

I have a pandas dataframe:

   x_axis  y_axis  data
0  Cheese   farms     4
1   wales  Cheese     3

That can be generated with the following code:

import pandas
cols=['x_axis','y_axis','data']
row1=['Cheese','farms',4]
row2=['wales','Cheese',3]
data=pandas.DataFrame([row1,row2],columns=cols)
print data

In reality the data I have is much bigger and the x and y axis are labels to a heat map. Because these labels are often quite large I want to enumerate them and replace them with an index (across both x and y axes i.e. so if cheese is 1 in x it is also 1 in y axis). I also need to be able to write a legend that maps the new indexes to their original values.

The desired output might look something like this:

  x_axis y_axis  data
0      1      2     4
1      3      1     3

Then the legend would be:

cheese=1
farms=2
wales=3

Can anybody give me some suggestions on how to do this programmatically?

Alex Riley · Accepted Answer

You need categorical variables.

Because you want to convert values in multiple columns, you need to stack() into a series and then call astype:

>>> s = df.loc[:, ['x_axis', 'y_axis']].stack().astype('category')
>>> s
0  x_axis    Cheese
   y_axis     farms
1  x_axis     wales
   y_axis    Cheese
dtype: category
Categories (3, object): [Cheese, farms, wales]

s is now a Series with categorical types: each unique string is mapped to an integer.

If you use the .cat accessor, you can get the integer code of each categorical variable. Using unstack() will give you back a DataFrame:

>>> s.cat.codes.unstack()
   x_axis  y_axis
0       0       1
1       2       0

This means that you can assign these integer columns back to the original columns with the following:

>>> df.loc[:, ['x_axis', 'y_axis']] = s.cat.codes.unstack()
>>> df
   x_axis  y_axis  data
0       0       1     4
1       2       0     3

The mapping of strings to integers is given by s.cat.categories in the form of an Index (so 'Cheese' = 0, 'farms' = 1, 'wales' = 2):

>>> s.cat.categories
Index(['Cheese', 'farms', 'wales'], dtype='object')

Replacing string elements of a pandas DataFrame with integers

Answers (1)

Related Questions