Validating pandas dataframe columns

Question

I have a dataframe with columns as below -

u'wellthie_issuer_identifier', u'issuer_name', u'service_area_identifier', u'hios_plan_identifier', u'plan_year', u'type'

I need to validate values in each column and finally have a dataframe which is valid.

For example, I need to check if plan_year column satisfies below validation

presence: true, numericality: true, length: { is: 4 }

hios_plan_identifier column satisfies below regex.

          format: /\A(\d{5}[A-Z]{2}[a-zA-Z0-9]{3,7}-TMP|\d{5}[A-Z]{2}\d{3,7}(\-?\d{2})*)\z/,
          presence: true, length: { minimum: 10 },

type column contains,

in: ['MetalPlan', 'MedicarePlan', 'BasicHealthPlan', 'DualPlan', 'MedicaidPlan', 'ChipPlan']

There are lot of columns which I need to validate. I have tried to give an example data.

I am able to check regex with str.contains('\A(\d{5}[A-Z]{2}[a-zA-Z0-9]{3,7}-TMP|\d{5}[A-Z]{2}\d{3,7}(\-?\d{2})*)\Z', regex=True)

Similary I can check other validation as well individually. I am confused as to how to put all the validation together. Should I put all in a if loop with and conditions. Is there a easy way to validate the dataframe columns ? Need help here

Kevin Fang · Accepted Answer

There are multiple pandas functions you could use of. Basically the syntax you could use to filter your dataframe by content is:

df = df[(condition1) & (condition2) & ...] # filter the df and assign to the same df

Specifically for your case, you could replace condition with following functions(expressions):

df[some_column] == some_value 
df[some_column].isin(some_list_of_values) # This check whether the value of the column is one of the values in the list
df[some_column].str.contains() # You can use it the same as str.contains()
df[some_column].str.isdigit() # Same usage as str.isdigit(), check whether string is all digits, need to make sure column type is string in advance
df[some_column].str.len() == 4 # Filter string with length of 4

Finally, if you want to reset the index, you could use df = df.reset_index(drop=True) to reset your output df index to 0,1,2,...

Edit: To check for NaN, NaT, None values you could use

df[some_column].isnull()

For multiple columns, you could use

df[[col1, col2]].isin(valuelist).all(axis=1)

Validating pandas dataframe columns

Answers (2)

Related Questions