NotSoShabby
NotSoShabby

Reputation: 3728

Pandas checking if a column is category issue

Im trying to loop over my columns and act differently if the column is category than if its something else.

Using the following method works for a series that is category but give an error when checking a series with object dtype.

if series.dtype == 'category':
    # do something

Works on category, but if the dtype is object throws:

Error:

Traceback (most recent call last):
  File "", line 382, in trace_task
    R = retval = fun(*args, **kwargs)
  File "", line 54, in run_data_template_task
    data_template.run(data_bundle, columns=columns)
  File "", line 531, in run
    self.to_parquet(data_bundle, columns=columns)
  File "", line 195, in to_parquet
    df = self.parse_df(df, columns=columns, overwrite_columns=overwrite_columns)
  File "", line 378, in parse_df
    df[col.name] = parse_series_with_nans(df[col.name], 'str')
  File "", line 369, in parse_series_with_nans
    if series.dtype == 'category':
TypeError: data type "category" not understood

On the other hand, Using:

if series.dtype is 'category':
    # do something

returns False even when the dtype is a category (which makes sense because its obviously not the same object)

a reproduce-able example:

         df = pd.DataFrame({'category_column': ['a', 'b', 'c'], 'other_column': [1, 2, 3]})
         df['category_column'] = df['category_column'].astype('category')
         df['category_column'].dtype is 'category'
Out[46]: False
         df['category_column'].dtype == 'category'
Out[47]: True
         df['other_column'].dtype == 'category'
Traceback (most recent call last):
  File "", line 3296, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-48-c6cc61c458d0>", line 1, in <module>
    d['other_column'].dtype == 'category'
TypeError: data type "category" not understood 

Upvotes: 4

Views: 1378

Answers (2)

Serge Ballesta
Serge Ballesta

Reputation: 149075

In fact the dtype of a Series is a complex object, and comparing it to a string may or not give expected results. Just look with your examples:

>>> print(repr(df.category_column.dtype))
CategoricalDtype(categories=['a', 'b', 'c'], ordered=False)
>>> print(repr(df.other_column.dtype))
dtype('int64')

That is enough to make sure that they are not string values!

If you need to to simple comparisons, you should use their name attribute which is indeed a string:

>>> df['category_column'].dtype.name == 'category'
True
>>> df['other_column'].dtype.name == 'category'
False

Upvotes: 3

user2314737
user2314737

Reputation: 29387

df['category_column'].dtype is 'category'

is false because the two objects are not the same object.

On the other hand,

df['category_column'].dtype == 'category'

because

All instances of CategoricalDtype compare equal to the string 'category'.

(https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#equality-semantics)

See also Understanding Python's "is" operator

Upvotes: 2

Related Questions