Reputation: 7681
I have a pandas dataframe:
x_axis y_axis data
0 Cheese farms 4
1 wales Cheese 3
That can be generated with the following code:
import pandas
cols=['x_axis','y_axis','data']
row1=['Cheese','farms',4]
row2=['wales','Cheese',3]
data=pandas.DataFrame([row1,row2],columns=cols)
print data
In reality the data I have is much bigger and the x
and y axis
are labels to a heat map. Because these labels are often quite large I want to enumerate them and replace them with an index (across both x
and y axes
i.e. so if cheese
is 1
in x
it is also 1
in y axis
). I also need to be able to write a legend that maps the new indexes to their original values.
The desired output might look something like this:
x_axis y_axis data
0 1 2 4
1 3 1 3
Then the legend would be:
cheese=1
farms=2
wales=3
Can anybody give me some suggestions on how to do this programmatically?
Upvotes: 2
Views: 627
Reputation: 176810
You need categorical variables.
Because you want to convert values in multiple columns, you need to stack()
into a series and then call astype
:
>>> s = df.loc[:, ['x_axis', 'y_axis']].stack().astype('category')
>>> s
0 x_axis Cheese
y_axis farms
1 x_axis wales
y_axis Cheese
dtype: category
Categories (3, object): [Cheese, farms, wales]
s
is now a Series with categorical types: each unique string is mapped to an integer.
If you use the .cat
accessor, you can get the integer code of each categorical variable. Using unstack()
will give you back a DataFrame:
>>> s.cat.codes.unstack()
x_axis y_axis
0 0 1
1 2 0
This means that you can assign these integer columns back to the original columns with the following:
>>> df.loc[:, ['x_axis', 'y_axis']] = s.cat.codes.unstack()
>>> df
x_axis y_axis data
0 0 1 4
1 2 0 3
The mapping of strings to integers is given by s.cat.categories
in the form of an Index (so 'Cheese' = 0, 'farms' = 1, 'wales' = 2):
>>> s.cat.categories
Index(['Cheese', 'farms', 'wales'], dtype='object')
Upvotes: 1