J. Leroux
J. Leroux

Reputation: 39

Why dtype is different when converting to a category from an object or str?

Why do I receive two different behaviours when converting a column to a category in pandas?

As an example, let say I create a dataframe with

>>> import pandas as pd
>>> import numpy as np
>>> pd.__version__
u'0.22.0'
>>> np.__version__
'1.14.0'
>>> df = pd.DataFrame(columns=['nombre'], data=[1,2,3,4])

Now I convert my column to an object:

>>> df['nombre'] = df['nombre'].astype('object')
>>> print(df['nombre'].dtype)
object

The dtype is now object.

>>> df['nombre'] = df['nombre'].astype('category')
>>> print(df['nombre'].cat.categories.dtype.name)
int64

After converting to a category, the internal dtype is int64.

Let's start again with a new dataframe

>>> del df
>>> df = pd.DataFrame(columns=['nombre'], data=[1,2,3,4])

This time, we convert the internal column to a 'str'

>>> df['nombre'] = df['nombre'].astype('str')
>>> print(df['nombre'].dtype)
object

The internal representation is an object. It makes sense since we converted to a 'str'.

>>> df['nombre'] = df['nombre'].astype('category')
>>> print(df['nombre'].cat.categories.dtype.name)
object

After converting to a category, the internal dtype is now object, which is different from the int64 that we received before?

So my question is the following, why do I receive two different behaviours when converting from an object dtype to a category?

Upvotes: 2

Views: 1159

Answers (1)

user2285236
user2285236

Reputation:

.astype(object) doesn't convert numbers to strings. It converts numbers to corresponding Python objects (in your example, numpy.int64 to a Python int).

For example,

df = pd.DataFrame(columns=['nombre'], data=[1,2,3,4])

type(df['nombre'][0])
Out[64]: numpy.int64


df['nombre'] = df['nombre'].astype('object')

type(df['nombre'][0])
Out[66]: int

But when you use astype(str), it converts everything to strings. While doing that, it also converts the Series to an object Series, too. This is because that's the only dtype that can hold strings.

df['nombre'] = df['nombre'].astype('str')

type(df['nombre'][0])
Out[69]: str

So this is just related to your input data. In the first one you pass ints, you get an integer array. In the second one you pass strings, you get an object array.

Also the term "the internal dtype" may not be appropriate here. This is the dtype of the Series that holds categories; not their codes. In both examples, df['nombre'].cat.codes is the internal representation and its dtype is int8.

Upvotes: 4

Related Questions