Reputation: 619
I have 2 dataframes, one of them of employee information by Country, and another one with a mapping of possible values for LOV columns per country (depending on the country, the column may or may not be an LOV and accept different values and is case sensitive). Example I have this data belonging to CZ:
Country Employee Name Contract Type Grade Education
CZ Jonathan permanent 1 Male
CZ Peter np.nan 6 male
CZ Maro Fixed . N/A
CZ Marisa Contractor 01-01-2020 Female
CZ Petra Permanent 5 Female
And this would be the mapping dataframe:
Country LOV Column Values
CZ Contract Type Permanent
CZ Contract Type Fixed
US Contract Type Permanent
US Contract Type Fixed
US Contract Type Contractor
AE Contract Type Permanent
CZ Grade 1
CZ Grade 2
CZ Grade 3
CZ Grade 4
CZ Grade 5
US Contract Type Manager
US Contract Type Non Manager
CZ Education 1st Degree
CZ Education 2nd Degree
SK Education A
SK Education B
SK Education C
AE Gender Male
AE Gender Female
I need a way to filter the LOV mapping dataframe for the Country I am analyzing, their LOV columns and possible list of values and then checking against the data which values they provided on those columns that are non compliant to the mapping table (in a list, and if an empty list this would mean all values complied to the mapping) and identify the rows the error occured (also in a list, and again if an empty list there is no error). If the cell is blank, do not consider it as an error.
Example:
Contract Type
Values non compliant in the column: ['permanent','Contractor']
Rows error occured: [0,3]
Education
Values non compliant in the column: ['Male', 'male', 'Female', 'Female']
Rows error occured: [0, 1, 3, 4]
Grade
Values non compliant in the column: ['6', '.', '01-01-2020']
Rows error occured: [1, 2, 3]
Thank you for the support!
Upvotes: 1
Views: 39
Reputation: 195613
You can try:
df = df.replace('np.nan', np.nan)
df_mapping = df_mapping.groupby(['Country', 'LOV Column'])['Values'].agg(set)
out = {}
for idx, row in df.iterrows():
country = row['Country']
for i, v in zip(row.index, row):
if pd.notna(v) and (country, i) in df_mapping and v not in df_mapping[(country, i)]:
out.setdefault(i, []).append((idx, v))
for k, v in out.items():
print(k)
print('Values non compliant in the column:', [i for _, i in v])
print('Rows error occured:', [i for i, _ in v])
print()
Prints:
Contract Type
Values non compliant in the column: ['permanent', 'Contractor']
Rows error occured: [0, 3]
Education
Values non compliant in the column: ['Male', 'male', 'Female', 'Female']
Rows error occured: [0, 1, 3, 4]
Grade
Values non compliant in the column: ['6', '.', '01-01-2020']
Rows error occured: [1, 2, 3]
Upvotes: 1