Reputation: 113
I am new to python and hence will appreciate any help on this!
Suppose i have a bunch of columns in a dataset with categorical values. Let's say Gender, marital status, etc.
While doing input validation of the dataset, i need to check if the values of columns are within an acceptable range.
For instance, if the column is gender, accpetable values as male, female. Suppose column is marital status, acceptable values are single, married, divorced.
if for instance, the user inputs a dataset with values for these variables outside the acceptable range, i need to write a function to point it out.
How do i do this?
suppose i create a static acceptable value mapping list like below, for all datasets:
dataset variable acceptable_values
demographics gender male,female
demographics marital status single,married,divorced
purchase region south,east,west,north
Ideally code should go through all variables in all datasets listed in above mapping file and see if the values are in the "acceptable_values" list
Suppose below are new datasets, the code show throw an output saying,
unacceptable values found for dataset: demographics, for variable: gender - Boy,Other,missing,(blank) unacceptable values found for dataset: demographics, for variable: maritalstatus- separated
demographics:
id gender maritalstatus
1 male single
2 male single
3 Boy single
4 Other married
5 missing divorced
6 (blank) separated
Let me know how this can be achieved. Looks fairly complicated to my understanding
It would be great if the code could convert the "new"/"unacceptable" values to NaN or 0 or something like that as well, but this is good to have.
Upvotes: 0
Views: 216
Reputation: 18211
You could do something like the following, where we assume that you're storing your data frames in a dictionary called df_dict
, and the collection of accepted values in a data frame called df_accepted
:
# First, use the dataset and variable name as indices in df_accepted
# to make it easier to perform lookups
df_accepted.set_index(['dataset', 'variable'], inplace=True)
# Loop over all data frames
for name, df in df_dict.items():
# Loop over all columns in the current data frame
for c in df:
# Find the indices for the given column for which the values
# does /not/ belong to the list of accepted values for this column.
try:
mask = ~df[c].isin(df_accepted.loc[name, c].acceptable_values.split(','))
# Print the values that did not belong to the list
print(f'Bad values for {c} in {name}: {", ".join(df[c][mask])}')
# Convert them into NaNs
df[c][mask] = np.nan
except KeyError:
print(f'Skipping validation of {c} in {name}')
With your given input:
In [200]: df_accepted
Out[200]:
dataset variable acceptable_values
0 demographics gender male,female
1 demographics maritalstatus single,married,divorced
2 purchase region south,east,west,north
In [201]: df_dict['demographics']
Out[201]:
gender maritalstatus
id
1 male single
2 male single
3 Boy single
4 Other married
5 missing divorced
6 (blank) separated
In [202]: df_dict['purchase']
Out[202]:
region count
0 south 60
1 west 90210
2 north-east 10
In [203]: df_accepted.set_index(['dataset', 'variable'], inplace=True)
...:
...: for name, df in df_dict.items():
...: for c in df:
...: try:
...: mask = ~df[c].isin(df_accepted.loc[name, c].acceptable_values.split(','))
...: print(f'Bad values for {c} in {name}: {", ".join(df[c][mask])}')
...: df[c][mask] = np.nan
...: except KeyError:
...: print(f'Skipping validation of {c} in {name}')
...:
Bad values for gender in demographics: Boy, Other, missing, (blank)
Bad values for maritalstatus in demographics: separated
Bad values for region in purchase: north-east
Skipping validation of count in purchase
In [204]: df_accepted
Out[204]:
acceptable_values
dataset variable
demographics gender male,female
maritalstatus single,married,divorced
purchase region south,east,west,north
In [205]: df_dict['demographics']
Out[205]:
gender maritalstatus
id
1 male single
2 male single
3 NaN single
4 NaN married
5 NaN divorced
6 NaN NaN
In [206]: df_dict['purchase']
Out[206]:
region count
0 south 60
1 west 90210
2 NaN 10
Upvotes: 1
Reputation: 2905
There might be a simpler way to do this, but this solution works:
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['region', 'number'], data=[['north',0],['south',-4],['hello',15]])
valid_values = {'region': {'north','south','west','east'}}
df = df.apply(lambda column:
column.apply(lambda x: x if x in valid_values[column.name] else np.nan)
if column.name in valid_values else column)
Upvotes: 0