Reputation: 349
I have two dataframes:
df_data = pd.DataFrame({'ID':['x','y','x','z'],
'col1':['1','','2','f'],
'col2':['5','5','5','5'],
'col3':['','1','3','5']})
df_data_req = pd.DataFrame({'ID':['x','y','z'],
'col1':['X','','X'],
'col2':['X','X','X'],
'col3':['X','X','']})
I would then like to validate all rows in df_data, and check (based in the ID column), if all columns there are specified in df_data_req have a value.
What I expected is addition column to df_data which indicate "Missing data" or "OK" which are depended on what there are specified in df_data_req. The output for the sample data would then be:
ID | col1 | col2 | col3 | validation |
---|---|---|---|---|
x | 1 | 5 | MissingData | |
y | 5 | 1 | OK | |
x | 2 | 5 | 3 | OK |
z | f | 5 | 5 | OK |
Upvotes: 1
Views: 698
Reputation: 75080
Here is an approach with set_index
, df.reindex_like
followed by conditional check:
a = df_data.set_index("ID").replace("",np.nan)
b = df_data_req.set_index("ID").replace("",np.nan)
c = a.fillna(b.reindex_like(a)).eq("X").any(1)
df_data['validation'] = np.where(c,"MissingData","OK")
print(df_data)
ID col1 col2 col3 validation
0 x 1 5 MissingData
1 y 5 1 OK
2 x 2 5 3 OK
3 z f 5 5 OK
Upvotes: 3