Cramer
Cramer

Reputation: 1795

Assign a category value to all rows in a pandas column

Context: Consider the following

import pandas as pd
X = pd.DataFrame({"A": [0, 1, 2, 3]})
Y = pd.DataFrame({"A": [5, 6, 7, 8]})

together= pd.concat([X.assign(s='x'), Y.assign(s='y')])

The final line there, I would like to have the dtype of s to be

cat_type = pd.api.types.CategoricalDtype(categories=['x','y'])

Of course, I can just do

together.s = together.s.astype(cat_type)

However, if X and Y are sufficiently large, this costs a large amount of memory for the intermediaries and every time I do these 'joins' it will convert from categories to strings and back.

Question: Is there a (clean) way to assign a single value from a category to a data frame column without paying the penalty of converting to strings and back?

Of course, the actual data I care about is quite large. The difference between categories and strings results in paging to disk.

Upvotes: 1

Views: 370

Answers (1)

jezrael
jezrael

Reputation: 862681

I think you can convert to categorical before concat:

cat_type = pd.api.types.CategoricalDtype(categories=['x','y'])

X = X.assign(s='x')
X.s = X.s.astype(cat_type)

Y = Y.assign(s='x')
Y.s = Y.s.astype(cat_type)

together = pd.concat([X, Y])
print (together.dtypes)

A       int64
s    category
dtype: object

Another solution is use:

cat_type = pd.api.types.CategoricalDtype(categories=['x','y'])
together= pd.concat([X.assign(s=pd.Categorical(['x'] * len(X), dtype=cat_type)), 
                     Y.assign(s=pd.Categorical(['y'] * len(Y), dtype=cat_type))])

print (together.dtypes)

A       int64
s    category
dtype: object

Upvotes: 1

Related Questions