kaos1511
kaos1511

Reputation: 113

Function to get acceptable values for a variable pandas

I am new to python and hence will appreciate any help on this!

Suppose i have a bunch of columns in a dataset with categorical values. Let's say Gender, marital status, etc.

While doing input validation of the dataset, i need to check if the values of columns are within an acceptable range.

For instance, if the column is gender, accpetable values as male, female. Suppose column is marital status, acceptable values are single, married, divorced.

if for instance, the user inputs a dataset with values for these variables outside the acceptable range, i need to write a function to point it out.

How do i do this?

suppose i create a static acceptable value mapping list like below, for all datasets:

dataset        variable       acceptable_values
demographics   gender         male,female
demographics   marital status single,married,divorced
purchase       region         south,east,west,north       

Ideally code should go through all variables in all datasets listed in above mapping file and see if the values are in the "acceptable_values" list

Suppose below are new datasets, the code show throw an output saying,

unacceptable values found for dataset: demographics, for variable: gender - Boy,Other,missing,(blank) unacceptable values found for dataset: demographics, for variable: maritalstatus- separated

demographics:
id  gender  maritalstatus
1   male    single
2   male    single
3   Boy     single
4   Other   married
5   missing divorced
6   (blank) separated

Let me know how this can be achieved. Looks fairly complicated to my understanding

It would be great if the code could convert the "new"/"unacceptable" values to NaN or 0 or something like that as well, but this is good to have.

Upvotes: 0

Views: 216

Answers (2)

fuglede
fuglede

Reputation: 18211

You could do something like the following, where we assume that you're storing your data frames in a dictionary called df_dict, and the collection of accepted values in a data frame called df_accepted:

# First, use the dataset and variable name as indices in df_accepted
# to make it easier to perform lookups
df_accepted.set_index(['dataset', 'variable'], inplace=True)

# Loop over all data frames
for name, df in df_dict.items():
    # Loop over all columns in the current data frame
    for c in df:
        # Find the indices for the given column for which the values
        # does /not/ belong to the list of accepted values for this column.
        try:
            mask = ~df[c].isin(df_accepted.loc[name, c].acceptable_values.split(','))
            # Print the values that did not belong to the list
            print(f'Bad values for {c} in {name}: {", ".join(df[c][mask])}')
            # Convert them into NaNs
            df[c][mask] = np.nan
        except KeyError:
            print(f'Skipping validation of {c} in {name}')

With your given input:

In [200]: df_accepted
Out[200]:
        dataset       variable        acceptable_values
0  demographics         gender              male,female
1  demographics  maritalstatus  single,married,divorced
2      purchase         region    south,east,west,north

In [201]: df_dict['demographics']
Out[201]:
     gender maritalstatus
id
1      male        single
2      male        single
3       Boy        single
4     Other       married
5   missing      divorced
6   (blank)     separated

In [202]: df_dict['purchase']
Out[202]:
       region  count
0       south     60
1        west  90210
2  north-east     10

In [203]: df_accepted.set_index(['dataset', 'variable'], inplace=True)
     ...:
     ...: for name, df in df_dict.items():
     ...:     for c in df:
     ...:         try:
     ...:             mask = ~df[c].isin(df_accepted.loc[name, c].acceptable_values.split(','))
     ...:             print(f'Bad values for {c} in {name}: {", ".join(df[c][mask])}')
     ...:             df[c][mask] = np.nan
     ...:         except KeyError:
     ...:             print(f'Skipping validation of {c} in {name}')
     ...:
Bad values for gender in demographics: Boy, Other, missing, (blank)
Bad values for maritalstatus in demographics: separated
Bad values for region in purchase: north-east
Skipping validation of count in purchase

In [204]: df_accepted
Out[204]:
                                  acceptable_values
dataset      variable
demographics gender                     male,female
             maritalstatus  single,married,divorced
purchase     region           south,east,west,north

In [205]: df_dict['demographics']
Out[205]:
   gender maritalstatus
id
1    male        single
2    male        single
3     NaN        single
4     NaN       married
5     NaN      divorced
6     NaN           NaN

In [206]: df_dict['purchase']
Out[206]:
  region  count
0  south     60
1   west  90210
2    NaN     10

Upvotes: 1

Itamar Mushkin
Itamar Mushkin

Reputation: 2905

There might be a simpler way to do this, but this solution works:

import pandas as pd
import numpy as np

df = pd.DataFrame(columns=['region', 'number'], data=[['north',0],['south',-4],['hello',15]])

valid_values = {'region': {'north','south','west','east'}}

df = df.apply(lambda column: 
         column.apply(lambda x: x if x in valid_values[column.name] else np.nan) 
         if column.name in valid_values else column)

Upvotes: 0

Related Questions