Reputation: 39
Why do I receive two different behaviours when converting a column to a category in pandas?
As an example, let say I create a dataframe with
>>> import pandas as pd
>>> import numpy as np
>>> pd.__version__
u'0.22.0'
>>> np.__version__
'1.14.0'
>>> df = pd.DataFrame(columns=['nombre'], data=[1,2,3,4])
Now I convert my column to an object:
>>> df['nombre'] = df['nombre'].astype('object')
>>> print(df['nombre'].dtype)
object
The dtype is now object.
>>> df['nombre'] = df['nombre'].astype('category')
>>> print(df['nombre'].cat.categories.dtype.name)
int64
After converting to a category, the internal dtype is int64.
Let's start again with a new dataframe
>>> del df
>>> df = pd.DataFrame(columns=['nombre'], data=[1,2,3,4])
This time, we convert the internal column to a 'str'
>>> df['nombre'] = df['nombre'].astype('str')
>>> print(df['nombre'].dtype)
object
The internal representation is an object. It makes sense since we converted to a 'str'.
>>> df['nombre'] = df['nombre'].astype('category')
>>> print(df['nombre'].cat.categories.dtype.name)
object
After converting to a category, the internal dtype is now object, which is different from the int64 that we received before?
So my question is the following, why do I receive two different behaviours when converting from an object dtype to a category?
Upvotes: 2
Views: 1159
Reputation:
.astype(object)
doesn't convert numbers to strings. It converts numbers to corresponding Python objects (in your example, numpy.int64
to a Python int
).
For example,
df = pd.DataFrame(columns=['nombre'], data=[1,2,3,4])
type(df['nombre'][0])
Out[64]: numpy.int64
df['nombre'] = df['nombre'].astype('object')
type(df['nombre'][0])
Out[66]: int
But when you use astype(str)
, it converts everything to strings. While doing that, it also converts the Series to an object Series, too. This is because that's the only dtype that can hold strings.
df['nombre'] = df['nombre'].astype('str')
type(df['nombre'][0])
Out[69]: str
So this is just related to your input data. In the first one you pass ints, you get an integer array. In the second one you pass strings, you get an object array.
Also the term "the internal dtype" may not be appropriate here. This is the dtype of the Series that holds categories; not their codes. In both examples, df['nombre'].cat.codes
is the internal representation and its dtype is int8
.
Upvotes: 4