Pandas check if required columns have data

Question

I have two dataframes:

df_data which contain and ID column which link it to another dataframe (df_data_req) and other columns there contain some data. Not all columns would have data, which is ok in some cases.
df_data_req contains the same ID as in df_data, and in this dataframe it is specified which columns there should have data.

df_data = pd.DataFrame({'ID':['x','y','x','z'],
                   'col1':['1','','2','f'],
                   'col2':['5','5','5','5'],
                   'col3':['','1','3','5']})
df_data_req = pd.DataFrame({'ID':['x','y','z'],
                   'col1':['X','','X'],
                   'col2':['X','X','X'],
                   'col3':['X','X','']})

I would then like to validate all rows in df_data, and check (based in the ID column), if all columns there are specified in df_data_req have a value.

What I expected is addition column to df_data which indicate "Missing data" or "OK" which are depended on what there are specified in df_data_req. The output for the sample data would then be:

ID	col1	col2	col3	validation
x	1	5		MissingData
y		5	1	OK
x	2	5	3	OK
z	f	5	5	OK

anky · Accepted Answer

Here is an approach with set_index, df.reindex_like followed by conditional check:

a = df_data.set_index("ID").replace("",np.nan)
b = df_data_req.set_index("ID").replace("",np.nan)
c = a.fillna(b.reindex_like(a)).eq("X").any(1)
df_data['validation'] = np.where(c,"MissingData","OK")

print(df_data)

  ID col1 col2 col3   validation
0  x    1    5       MissingData
1  y         5    1           OK
2  x    2    5    3           OK
3  z    f    5    5           OK

Pandas check if required columns have data

Answers (1)

Related Questions