Reputation: 761
I have a dataset with a string column (name: 14) that I want to convert to interpret as a categorical feature. As far as I know there're two ways to do that:
pd.Categorical(data[14])
data[14].astype('category')
While both of these produce result with the same .dtype
: CategoricalDtype(categories=[' <=50K', ' >50K'], ordered=False)
they're not the same.
Calling .describe()
on the results they produce different outputs. The first one outputs information about individual categories while the second one (astype(..)
) results in typical describe output with count, unique, top, freq, and name, listing dtype: object
.
My question is, then, why / how do they differ?
It's this dataset: http://archive.ics.uci.edu/ml/datasets/Adult
data = pd.read_csv("./adult/adult.data", header=None)
pd.Categorical(data[14]).describe()
data[14].astype('category').describe()
pd.Categorical(data[14]).dtype
data[14].astype('category').dtype
Upvotes: 4
Views: 3753
Reputation: 879601
As Bakuriu points out, type(pd.Categorical(data[14]))
is Categorical
, while
type(data[14].astype('category'))
is Series
:
import pandas as pd
data = pd.read_csv("./adult/adult.data", header=None)
cat = pd.Categorical(data[14])
ser = data[14].astype('category')
print(type(cat))
# pandas.core.arrays.categorical.Categorical
print(type(ser))
# pandas.core.series.Series
The behavior of describe()
differs
because Categorical.describe
is defined differently than Series.describe
.
Whenever you call Categorical.describe()
, you'll get count
and freq
per category:
In [174]: cat.describe()
Out[174]:
counts freqs
categories
<=50K 24720 0.75919
>50K 7841 0.24081
and whenever you call Series.describe()
on a categorical Series, you'll get count
, unique
, top
and freq
. Note that count
and freq
have a different meaning here too:
In [175]: ser.describe()
Out[175]:
count 32561
unique 2
top <=50K
freq 24720
Name: 14, dtype: object
Upvotes: 4