DaveIdito
DaveIdito

Reputation: 1606

Preserving dtype as category after integer encoding in Pandas DataFrame column

I have a Pandas DataFrame read from csv that has some columns with string values but are actually object types. Because they are categorical, I convert them into category and then into integer representation, and then I am fitting a random forest regressor.

for col in df_raw.select_dtypes(include='object'):
    df_raw[col] = df_raw[col].astype('category')
    df_raw[col] = df_raw[col].cat.codes #not 'category' type anymore.

The problem is if I do this, then the dtype is immediately converted to int and I lose the cat information, which I need later.

For example, after the first line in the loop, I can run df_raw[col].cat, and I would get the indexed categories as expected. But once the second line is executed, the column dtype changes to int8, I will get the error:

Can only use .cat accessor with a 'category' dtype`

which, in a way makes perfect sense, since it's dtype is int8.

Is it possible to preserve the category encoding information in the same DataFrame and at the sametime have integer encodings in place to fit the regressor? How?

Upvotes: 4

Views: 4462

Answers (2)

Amith
Amith

Reputation: 1

You can use

train.col = pd.Categorical(train.col)

To make it go back to categorical type from int

And then run

train.col.cat.codes  

Upvotes: 0

sophros
sophros

Reputation: 16660

1. Simple idea

Why won't you use a derived column in the regressor fitting, e.g.:

df_raw[col + '_calculated'] = df_raw[col].cat.codes

In this way you have both: a categorical column col that does not change this feature and a "calculated" column with ints as needed for further processing?

2. More clever approach

Another approach could be that you wrap the dataframe before passing it to the fit method in such a way that regressor accesses .cat.codes instead of the categorical value directly:

def access_wrapper(dframe, col):
   yield from dframe[col].cat.codes

fit(..., access_wrapper(df, col))

In this way you do not affect the dataframe at all and do not copy the values from df[col] at the expense of calling the dframe[col].cat.codes per each access to the value (which should be fairly quick).

Upvotes: 1

Related Questions