Reputation: 1606
I have a Pandas DataFrame read from csv that has some columns with string values but are actually object
types. Because they are categorical, I convert them into category
and then into integer representation, and then I am fitting a random forest regressor.
for col in df_raw.select_dtypes(include='object'):
df_raw[col] = df_raw[col].astype('category')
df_raw[col] = df_raw[col].cat.codes #not 'category' type anymore.
The problem is if I do this, then the dtype
is immediately converted to int
and I lose the cat
information, which I need later.
For example, after the first line in the loop, I can run df_raw[col].cat
, and I would get the indexed categories as expected. But once the second line is executed, the column dtype
changes to int8
, I will get the error:
Can only use .cat accessor with a 'category' dtype`
which, in a way makes perfect sense, since it's dtype is int8
.
Is it possible to preserve the category encoding information in the same DataFrame and at the sametime have integer encodings in place to fit the regressor? How?
Upvotes: 4
Views: 4462
Reputation: 1
You can use
train.col = pd.Categorical(train.col)
To make it go back to categorical type from int
And then run
train.col.cat.codes
Upvotes: 0
Reputation: 16660
1. Simple idea
Why won't you use a derived column in the regressor fitting, e.g.:
df_raw[col + '_calculated'] = df_raw[col].cat.codes
In this way you have both: a categorical column col
that does not change this feature and a "calculated" column with int
s as needed for further processing?
2. More clever approach
Another approach could be that you wrap the dataframe before passing it to the fit
method in such a way that regressor accesses .cat.codes
instead of the categorical value directly:
def access_wrapper(dframe, col):
yield from dframe[col].cat.codes
fit(..., access_wrapper(df, col))
In this way you do not affect the dataframe at all and do not copy the values from df[col]
at the expense of calling the dframe[col].cat.codes
per each access to the value (which should be fairly quick).
Upvotes: 1