Reputation: 4815

Categorical Variables In A Pandas Dataframe?

I am working my way through Wes's Python For Data Analysis, and I've run into a strange problem that is not addressed in the book.

In the code below, based on page 199 of his book, I create a dataframe and then use pd.cut() to create cat_obj. According to the book, cat_obj is

"a special Categorical object. You can treat it like an array of strings indicating the bin name; internally it contains a levels array indicating the distinct category names along with a labeling for the ages data in the labels attribute"

Awesome! However, if I use the exact same pd.cut() code (In [5] below) to create a new column of the dataframe (called df['cat']), that column is not treated as a special categorical variable but simply as a regular pandas series.

How, then, do I create a column in a dataframe that is treated as a categorical variable?

In [4]:

import pandas as pd

raw_data = {'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'score': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['name', 'score'])

bins = [0, 25, 50, 75, 100]
group_names = ['Low', 'Okay', 'Good', 'Great']

In [5]:
cat_obj = pd.cut(df['score'], bins, labels=group_names)
df['cat'] = pd.cut(df['score'], bins, labels=group_names)
In [7]:

type(cat_obj)
Out[7]:
pandas.core.categorical.Categorical
In [8]:

type(df['cat'])
Out[8]:
pandas.core.series.Series

Upvotes: 16

Answers (3)

undershock

Reputation: 803

From http://pandas-docs.github.io/pandas-docs-travis/categorical.html, from pandas 0.15 onwards

Specify dtype="category" when constructing a Series:

In [1]: s = pd.Series(["a","b","c","a"], dtype="category")

In [2]: s
Out[2]: 
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): [a, b, c]

You can then add this to an existing series.

Or convert an existing Series or column to a category dtype:

In [3]: df = pd.DataFrame({"A":["a","b","c","a"]})

In [4]: df["B"] = df["A"].astype('category')

In [5]: df
Out[5]: 
   A  B
0  a  a
1  b  b
2  c  c
3  a  a

Upvotes: 0

xrage

Reputation: 4800

It might be happening because of this kind of behaviour by setter-:

Sample getter and setter-

class a:
    x = 1
    @property
    def p(self):
        return int(self.x)

    @p.setter
    def p(self,v):
        self.x = v
t = 1.32
a().p = 1.32


print type(t) --> <type 'float'>
print type(a().p) --> <type 'int'>

For now df only accepts Series data and its setter converts Categorial data into Series. df categorial support is due in Next Pandas release.

Upvotes: 1

jmxp

Reputation: 1

Right now, you can't have categorical data in a Series or DataFrame object, but this functionality will be implemented in Pandas 0.15 (due in September).

Upvotes: 0

Categorical Variables In A Pandas Dataframe?

Answers (3)

Related Questions