wsrt
wsrt

Reputation: 211

How to compare a dataframe with another df and return new rows from the first df using pandas

I am trying to ascertain if values in a test dataframe (df2) are not appearing in another DF (df1). The following are the two DFs:

df1 created from the following source:

field1 field2
AG Agree
SA Somewhat Agree
DG Disagree
SD Somewhat Disagree
NO None

df2 created from the following source:

field1 field2
CA California
TX Texas
NO None
NY New York

Using Method 1 (see below), I am getting the expected result, which is:

Method 1

diff_df = df2[~(df2[field1].isin(df1[field1]) & df2[field2].isin(df1[field2]))].reset_index(drop=True)

This gives me the folllowing expected result:

  field1               field2
0     CA           California
1     TX                Texas
2     NY             New York

Note: The duplicate value in df2 (NO: None) gets dropped, too.

However, there is one problem that I am facing: There can be situations when there are different set of fields that will need to be compared (eg. there may be a third field field3 in the equation).

From case to case basis, the number of fields would vary greatly over which the user won't have control.

My problem: How do I modify my query so that by comparing the two dataframes I get the expected result?

In the situation as explained, what shuld be the possible approach?

Upvotes: 1

Views: 43

Answers (1)

CN Entertainment
CN Entertainment

Reputation: 46

Try:

import pandas as pd

# Creating the dataframe
df1_field1 = ['a', 'b', 'c', 'd'] 
df1_field2 = ['a', 'b', 'c', 'd']
df1_field3 = ['a', 'b', 'c', 'd']
df1 = pd.DataFrame(
    {'field1':df1_field1,
     'field2':df1_field2,
     'field3':df1_field3,     
    })

df2_field1 = ['a', 'c', 'd', 'e'] 
df2_field2 = ['a', 'c', 'd', 'e']
df2_field3 = ['a', 'c', 'd', 'e']
df2 = pd.DataFrame(
    {'field1':df2_field1,
     'field2':df2_field2,
     'field3':df2_field3,     
    })

print(df)
print(df2)

df_all = df2.merge(df1.drop_duplicates(), on=['field1','field2','field3'], 
                   how='outer', indicator=True)
df_all[df_all['_merge'] == 'left_only']

it yields:

  field1 field2 field3
0      a      a      a
1      b      b      b
2      c      c      c
3      d      d      d
  field1 field2 field3
0      a      a      a
1      c      c      c
2      d      d      d
3      e      e      e

field1  field2  field3  _merge
3   e   e   e   left_only

As you can see, it works, and it is just an adaptation of another answer in the page you put in the description.

Upvotes: 2

Related Questions