Naruto
Naruto

Reputation: 139

How to find matching values between the columns of two dataframes?

I have two DataFrames, named Child and Parent. Child acts like a metadata table, and I need to validate the Data of Parent dataframe.

Child:

Cust_id     Description   Detail
1           Good          Regular
34          Excellent     Normal
45          Bulk          Buyer

Parent:

Name    Cust_id order   date    Payment
xyz     1       ice     01-02-2019  online
abc     45      bread   01-02-2019  offline
mno     56      Butter  01-02-2019  offline
pqr     67      cookies 01-02-2019  online
rst     34      Rice    01-02-2019  online
ert     1       egg     01-02-2019  online

I need to validate Cust_id present in the Parent dataframe or not?

I am selecting on all record Cust_id of Parent table using distinct, and using a loop to check if all data present in the Parent table, are present in the Child table or not.

How can this be accomplished without iteration, using pandas methods?

Upvotes: 1

Views: 1430

Answers (1)

Trenton McKinney
Trenton McKinney

Reputation: 62403

  • Use pandas Boolean Indexing the determine if 'Cust_id' of parent, is in 'Cust_id' of 'child'.
  • Use .isin on a list of unique 'Cust_id' from 'child'.
    • Indexing with isin
    • child.Cust_id.unique() creates an array of all the unique values in 'Cust_id'
import pandas as pd

child = pd.DataFrame({'Cust_id': [1, 34, 45], 'Description': ['Good', 'Excellent', 'Bulk'], 'Detail': ['Regular', 'Normal', 'Buyer']})

parent = pd.DataFrame({'Name': ['xyz', 'abc', 'mno', 'pqr', 'rst', 'ert'], 'Cust_id': [1, 45, 56, 67, 34, 1], 'order': ['ice', 'bread', 'Butter', 'cookies', 'Rice', 'egg'],
                       'date': ['01-02-2019', '01-02-2019', '01-02-2019', '01-02-2019', '01-02-2019', '01-02-2019'], 'Payment': ['online', 'offline', 'offline', 'online', 'online', 'online']})

# mask using isin
mask = parent.Cust_id.isin(child.Cust_id.unique())

# return only the data from parent, where parent Cust_id isin child Cust_id
parent[mask]

# add a column to the parent dataframe
parent['in_child'] = mask

# display(parent)
  Name  Cust_id    order        date  Payment  in_child
0  xyz        1      ice  01-02-2019   online      True
1  abc       45    bread  01-02-2019  offline      True
2  mno       56   Butter  01-02-2019  offline     False
3  pqr       67  cookies  01-02-2019   online     False
4  rst       34     Rice  01-02-2019   online      True
5  ert        1      egg  01-02-2019   online      True
  • pandas.DataFrame.merge can be used in various ways as well.
  • The following solution uses an 'outer' merge with indicator=True
    • The '_merge' column indicates which dataframe the 'Cust_id' is in.
      • 'left_only' is the parent dataframe.
  • .merge combines the information from both dataframes, and I'm not sure if that's the desired output.
merged = parent.merge(child, on='Cust_id', how='outer', indicator=True)

# display(merged)
  Name  Cust_id    order        date  Payment Description   Detail     _merge
0  xyz        1      ice  01-02-2019   online        Good  Regular       both
1  ert        1      egg  01-02-2019   online        Good  Regular       both
2  abc       45    bread  01-02-2019  offline        Bulk    Buyer       both
3  mno       56   Butter  01-02-2019  offline         NaN      NaN  left_only
4  pqr       67  cookies  01-02-2019   online         NaN      NaN  left_only
5  rst       34     Rice  01-02-2019   online   Excellent   Normal       both

Upvotes: 1

Related Questions