Reputation: 307
I'm new to use pandas in python whereas I have good knowledge in working with python.
I've two data frames from which I've to get matching records and non matching records into new data frames.
Example :
DF1 :
ID Name Number DOB Salary
1 AAA 1234 12-05-1996 100000
2 BBB 1235 16-08-1997 200000
3 CCC 1236 24-04-1998 389999
4 DDD 1237 05-09-2000 450000
DF2 :
ID Name Number DOB Salary
1 AAA 1234 12-05-1996 100000
2 BBB 1235 16-08-1997 200000
3 CCC 1236 24-04-1998 389999
4 DDD 1237 05-09-2000 540000
And, with primary keys being ID & Name here(in reality the number of keys might vary), I need to get
Match_df :
ID Name Number DOB Salary
1 AAA 1234 12-05-1996 100000
2 BBB 1235 16-08-1997 200000
3 CCC 1236 24-04-1998 389999
Mismatch_df :
ID Name Number DOB Salary
4 DDD 1237 05-09-2000 540000
I've tried all possible ways like
pd.merge(df1, df2, left_on=[ID,Name],right_on=[ID,Name], how='inner')
and this produces all the unique keys that are in both the data frames. But this also produces non matching records.
But I'm getting this as my result :
ID Name Number DOB Salary
1 AAA 1234 12-05-1996 100000
2 BBB 1235 16-08-1997 200000
3 CCC 1236 24-04-1998 389999
4 DDD 1237 05-09-2000 540000
where 4th record is also getting included.
Here, only salary col is varying but in real Time, it may be a list of cols to be compared.
From this, I've to get only matching records to the matched_df and non matching records to the mismatch_df.
Kindly help me out in doing this.
Note: My dataset might be a massive one (100 million records in both datasets) so, please get me an effective approach reducing the time of execution.
Thanks in advance.
Upvotes: 6
Views: 25090
Reputation: 914
# pick index keys and compare column(s)
keys = ['ID', 'Name']
# if comparing all columns:
col_list = [col for col in df1.columns if col not in keys]
# # if comparing specific columns:
# col_list = ['Salary', 'DOB']
# extend keys with col_list for next step
sel_cols = keys.copy()
sel_cols.extend(col_list)
# set a multi-index with keys
# to dataframes with col_list columns
dfa = df1[sel_cols].set_index(keys)
dfb = df2[sel_cols].set_index(keys)
# make an equivalency boolean mask
dfa.update(dfb)
mask = np.equal(df1[col_list].values, dfa.values).all(axis=1)
# slice df1 with mask
Match_df = df1[mask]
Mismatch_df = df1[~mask]
Upvotes: 5
Reputation: 8826
Simplistic Answer to your question is with df1.where
:
Note: The resulting cells with NaN do not satisfy the conditions, i.e. they are not equal in the two dataframes. The ones that have a real value are the ones that are equal in the two dataframes
>>> df1.where(df1.Salary==df2.Salary)
DoB ID Name Salary
0 12-05-1996 1 AAA 100000.0
1 16-08-1997 2 BBB 200000.0
2 24-04-1998 3 CCC 389999.0
3 NaN NaN NaN NaN
While going with pd.merge
: If you Just want to merge the df1 & df1 without Column or index level then it will take defaults to the intersection of the columns in both DataFrames.
>>> pd.merge(df1, df2)
DoB ID Name Salary
0 12-05-1996 1 AAA 100000
1 16-08-1997 2 BBB 200000
2 24-04-1998 3 CCC 389999
If you wish to Join Column or index level then use on
.
>>> pd.merge(df1, df2, on="Salary")
DoB_x ID_x Name_x Salary DoB_y ID_y Name_y
0 12-05-1996 1 AAA 100000 12-05-1996 1 AAA
1 16-08-1997 2 BBB 200000 16-08-1997 2 BBB
2 24-04-1998 3 CCC 389999 24-04-1998 3 CCC
For mismatch in df2
: you can opt isin(dict) method:
>>> df2[~df2.isin(df1.to_dict('l')).all(1)]
DoB ID Name Salary
3 05-09-2000 4 DDD 540000
another way as Mabel given.
df2[~df2.isin(df1).all(axis=1)]
Upvotes: 7
Reputation: 2598
To get the match:
>> df1.merge(df2)
ID Name Number DOB Salary
0 1 AAA 1234 12-05-1996 100000
1 2 BBB 1235 16-08-1997 200000
2 3 CCC 1236 24-04-1998 389999
And for the mismatch selecting the row in df2
:
>> df2[~df2.isin(df1).all(axis=1)]
Name Number DOB Salary
ID
4 DDD 1237 05-09-2000 540000
Upvotes: 0
Reputation: 1145
My solution would be a bit different and would involve simply copying the salary from the other dataset over.
Such as:
DF1["Salary2"] = DF2["Salary"]
MatchDF = DF1[DF1["Salary"] == DF1["Salary2"]]
MisMatchDF = DF1[DF1["Salary"] != DF1["Salary2"]]
Upvotes: 0