Reputation: 1795
Context: Consider the following
import pandas as pd
X = pd.DataFrame({"A": [0, 1, 2, 3]})
Y = pd.DataFrame({"A": [5, 6, 7, 8]})
together= pd.concat([X.assign(s='x'), Y.assign(s='y')])
The final line there, I would like to have the dtype of s
to be
cat_type = pd.api.types.CategoricalDtype(categories=['x','y'])
Of course, I can just do
together.s = together.s.astype(cat_type)
However, if X
and Y
are sufficiently large, this costs a large amount of memory for the intermediaries and every time I do these 'joins' it will convert from categories to strings and back.
Question: Is there a (clean) way to assign a single value from a category to a data frame column without paying the penalty of converting to strings and back?
Of course, the actual data I care about is quite large. The difference between categories and strings results in paging to disk.
Upvotes: 1
Views: 370
Reputation: 862681
I think you can convert to categorical
before concat
:
cat_type = pd.api.types.CategoricalDtype(categories=['x','y'])
X = X.assign(s='x')
X.s = X.s.astype(cat_type)
Y = Y.assign(s='x')
Y.s = Y.s.astype(cat_type)
together = pd.concat([X, Y])
print (together.dtypes)
A int64
s category
dtype: object
Another solution is use:
cat_type = pd.api.types.CategoricalDtype(categories=['x','y'])
together= pd.concat([X.assign(s=pd.Categorical(['x'] * len(X), dtype=cat_type)),
Y.assign(s=pd.Categorical(['y'] * len(Y), dtype=cat_type))])
print (together.dtypes)
A int64
s category
dtype: object
Upvotes: 1