JohnnieL
JohnnieL

Reputation: 1231

Pandas series - how to validate each item is categorical

I am importing data which should be categorical from an externally sourced csv file into a pandas dataframe.

The first thing I want to do is to validate that the values are valid for the categorical type.

My strategy is to create an instance of CategoricalDtype and then using apply to test each value.

Question: The only way I can figure out is to test each value is in CategoricalDtype.categories.values but is there a "better" way? are there any methods I can use to achieve the same? I'm new to CategoricalDtype and it doesnt feel like this is the best way to be testing the data value.

# example of what I'm doing
import pandas as pd
from pandas.api.types import CategoricalDtype


df = pd.read_csv('data.csv')
cat = CategoricalDtype(categories=["A", "B", "C"], ordered=False)

df['data_is_valid']=df['data_field'].apply(lambda x: x in cat.categories.values)

Upvotes: 1

Views: 377

Answers (1)

jezrael
jezrael

Reputation: 863166

If need test if exist values from column data_field :

df['data_is_valid']=df['data_field'].isin(cat.categories)

If need test also categorical_dtype:

from pandas.api.types import is_categorical_dtype

df['data_is_valid']=df['data_field'].isin(cat.categories) & is_categorical_dtype(df['data_field'])

Difference is possible see in data sample:

from pandas.api.types import CategoricalDtype
from pandas.api.types import is_categorical_dtype
            
df = pd.DataFrame({ "data_field": ["A", "B", "C", "D", 'E']})

cat = CategoricalDtype(categories=["A", "B", "C"], ordered=False)

#categories match but not Categorical
df['data_is_valid1']=df['data_field'].isin(cat.categories) & is_categorical_dtype(df['data_field'])
#categories match not tested Categorical
df['data_is_valid2']=df['data_field'].isin(cat.categories)

cat_type = CategoricalDtype(categories=["A", "B", "C", 'D', 'E'], ordered=True)

#created Categorical column
df['data_field'] = df['data_field'].astype(cat_type)

#categoriesand Categorical match
df['data_is_valid3']=df['data_field'].isin(cat.categories) & is_categorical_dtype(df['data_field'])
#categories match not tested Categorical
df['data_is_valid4']=df['data_field'].isin(cat.categories)

print (df)
  data_field  data_is_valid1  data_is_valid2  data_is_valid3  data_is_valid4
0          A           False            True            True            True
1          B           False            True            True            True
2          C           False            True            True            True
3          D           False           False           False           False
4          E           False           False           False           False

Upvotes: 1

Related Questions