Reputation: 139
I have two DataFrames, named Child
and Parent
. Child
acts like a metadata table, and I need to validate the Data of Parent
dataframe.
Child:
Cust_id Description Detail
1 Good Regular
34 Excellent Normal
45 Bulk Buyer
Parent:
Name Cust_id order date Payment
xyz 1 ice 01-02-2019 online
abc 45 bread 01-02-2019 offline
mno 56 Butter 01-02-2019 offline
pqr 67 cookies 01-02-2019 online
rst 34 Rice 01-02-2019 online
ert 1 egg 01-02-2019 online
I need to validate Cust_id
present in the Parent
dataframe or not?
I am selecting on all record Cust_id
of Parent
table using distinct, and using a loop to check if all data present in the Parent
table, are present in the Child
table or not.
How can this be accomplished without iteration, using pandas methods?
Upvotes: 1
Views: 1430
Reputation: 62403
'Cust_id'
of parent
, is in 'Cust_id'
of 'child'
..isin
on a list of unique 'Cust_id'
from 'child'.
child.Cust_id.unique()
creates an array of all the unique values in 'Cust_id'
import pandas as pd
child = pd.DataFrame({'Cust_id': [1, 34, 45], 'Description': ['Good', 'Excellent', 'Bulk'], 'Detail': ['Regular', 'Normal', 'Buyer']})
parent = pd.DataFrame({'Name': ['xyz', 'abc', 'mno', 'pqr', 'rst', 'ert'], 'Cust_id': [1, 45, 56, 67, 34, 1], 'order': ['ice', 'bread', 'Butter', 'cookies', 'Rice', 'egg'],
'date': ['01-02-2019', '01-02-2019', '01-02-2019', '01-02-2019', '01-02-2019', '01-02-2019'], 'Payment': ['online', 'offline', 'offline', 'online', 'online', 'online']})
# mask using isin
mask = parent.Cust_id.isin(child.Cust_id.unique())
# return only the data from parent, where parent Cust_id isin child Cust_id
parent[mask]
# add a column to the parent dataframe
parent['in_child'] = mask
# display(parent)
Name Cust_id order date Payment in_child
0 xyz 1 ice 01-02-2019 online True
1 abc 45 bread 01-02-2019 offline True
2 mno 56 Butter 01-02-2019 offline False
3 pqr 67 cookies 01-02-2019 online False
4 rst 34 Rice 01-02-2019 online True
5 ert 1 egg 01-02-2019 online True
pandas.DataFrame.merge
can be used in various ways as well.'outer'
merge with indicator=True
'_merge'
column indicates which dataframe the 'Cust_id'
is in.
'left_only'
is the parent
dataframe..merge
combines the information from both dataframes, and I'm not sure if that's the desired output.merged = parent.merge(child, on='Cust_id', how='outer', indicator=True)
# display(merged)
Name Cust_id order date Payment Description Detail _merge
0 xyz 1 ice 01-02-2019 online Good Regular both
1 ert 1 egg 01-02-2019 online Good Regular both
2 abc 45 bread 01-02-2019 offline Bulk Buyer both
3 mno 56 Butter 01-02-2019 offline NaN NaN left_only
4 pqr 67 cookies 01-02-2019 online NaN NaN left_only
5 rst 34 Rice 01-02-2019 online Excellent Normal both
Upvotes: 1