jvels
jvels

Reputation: 349

Pandas check if required columns have data

I have two dataframes:

  1. df_data which contain and ID column which link it to another dataframe (df_data_req) and other columns there contain some data. Not all columns would have data, which is ok in some cases.
  2. df_data_req contains the same ID as in df_data, and in this dataframe it is specified which columns there should have data.
df_data = pd.DataFrame({'ID':['x','y','x','z'],
                   'col1':['1','','2','f'],
                   'col2':['5','5','5','5'],
                   'col3':['','1','3','5']})
df_data_req = pd.DataFrame({'ID':['x','y','z'],
                   'col1':['X','','X'],
                   'col2':['X','X','X'],
                   'col3':['X','X','']})

I would then like to validate all rows in df_data, and check (based in the ID column), if all columns there are specified in df_data_req have a value.

What I expected is addition column to df_data which indicate "Missing data" or "OK" which are depended on what there are specified in df_data_req. The output for the sample data would then be:

ID col1 col2 col3 validation
x 1 5 MissingData
y 5 1 OK
x 2 5 3 OK
z f 5 5 OK

Upvotes: 1

Views: 698

Answers (1)

anky
anky

Reputation: 75080

Here is an approach with set_index, df.reindex_like followed by conditional check:

a = df_data.set_index("ID").replace("",np.nan)
b = df_data_req.set_index("ID").replace("",np.nan)
c = a.fillna(b.reindex_like(a)).eq("X").any(1)
df_data['validation'] = np.where(c,"MissingData","OK")

print(df_data)

  ID col1 col2 col3   validation
0  x    1    5       MissingData
1  y         5    1           OK
2  x    2    5    3           OK
3  z    f    5    5           OK

Upvotes: 3

Related Questions