Reputation: 109
Say I have a large dask dataframe of fruit. I have thousands of rows but only about 30 unique fruit names, so I make that column a category:
df['fruit_name'] = df.fruit_name.astype('category')
Now that this is a category, can I no longer filter it? For instance,
df_kiwi = df[df['fruit_name'] == 'kiwi']
will return TypeError("invalid type comparison")
If I try to create a "dummy" dataframe and merge against that, I get a ValueError: "You are trying to merge on int8 and category columns..."
df_dummy = pd.DataFrame(data={'fruit_name': 'kiwi'}, index=range(1))
df_dummy['fruit_name'] = df_dummy.fruit_name.astype('category')
df_new = df.merge(df_dummy, how="inner", on="fruit_name")
Do I lose certain merge and filter functionality on a categorical column, or am I just doing this wrong (I am still extremely new to dask and pandas). Thanks!
Upvotes: 1
Views: 12945
Reputation: 57271
Here is an example showing this working well:
In [1]: import dask
In [2]: df = dask.datasets.timeseries()
In [3]: df.head()
Out[3]:
id name x y
timestamp
2000-01-01 00:00:00 978 Hannah 0.194721 0.518782
2000-01-01 00:00:01 973 Michael -0.894162 -0.454409
2000-01-01 00:00:02 1043 Bob 0.829046 -0.585921
2000-01-01 00:00:03 1027 Edith -0.109735 0.563914
2000-01-01 00:00:04 970 Patricia -0.621248 -0.655324
In [4]: df['name'] = df.name.astype('category')
In [5]: df[df.name == 'Alice'].head()
Out[5]:
id name x y
timestamp
2000-01-01 00:00:23 997 Alice -0.662165 -0.260169
2000-01-01 00:00:58 1012 Alice -0.840159 -0.036770
2000-01-01 00:01:23 961 Alice 0.831663 0.022570
2000-01-01 00:01:27 987 Alice -0.874289 -0.358708
2000-01-01 00:02:09 984 Alice 0.445238 -0.658470
I recommend constructing a minimal failing example
Upvotes: 3