Counting similarities among two pandas dataframes

Question

I am relatively new to python, but have been using Wes Kinney's, "Python for Data Analysis" religiously for the last few weeks. I have spent hours trying to figure out a solution to my current issue, but I need some help. I have a data set containing details for shipments that were sent in this calendar year, however; as I get new data each month, some details may have changed. I have figured out how to identify the shipments that have changed, as well as how to identify what those changes may be.

So, let's say I've identified that these shipments (in dataframe, original) have changed:

ID      Code    Mode    Amount    From     To     Weight    Cube    Service_Date
MNO123  BBB     Air     50        M1234    M9876  60        6       1-1-2013
GHI123  AAA     Air     50        M1234    M9876  80        8       1-1-2013
JKL123  AAA     Ship    50        M1234    M9876  70        7       1-1-2013

And I've identified the potential changes (in a dataframe, changes) as:

ID      Code    Mode    Amount    From     To     Weight    Cube    Service_Date
MNO123  BBB     Air     50        M1234    M9876  60        6       2-2-2013
MNO123  BBB     Air     60        M1234    M9876  60        6       2-2-2013
MNO123  BBB     Air     70        M1234    M1111  60        6       2-2-2013
GHI123  AAA     Air     65        M1234    M9876  80        8       1-1-2013
JKL123  AAA     Ship    65        M1234    M9876  70        7       1-1-2013
JKL123  AAA     Ship    65        M1234    M9876  70        8       1-1-2013

All I'm trying to do is add a count column to the changes dataframe which totals the amount of values that match the respective values in the original dataframe. So, because code, mode, amount, from, to, weight, and cube match, the count column would get the value of 7 for the first observation. Similarly, but having one less matching value, the second observation would have a count value of 6 and the third would have a count value of 5.

The result I am looking for is as follows:

ID      Code    Mode    Amount    From     To     Weight    Cube    Service_Date  Count
MNO123  BBB     Air     50        M1234    M9876  60        6       2-2-2013      7
MNO123  BBB     Air     60        M1234    M9876  60        6       2-2-2013      6
MNO123  BBB     Air     70        M1234    M1111  60        6       2-2-2013      5
GHI123  AAA     Air     65        M1234    M9876  80        8       1-1-2013      7
JKL123  AAA     Ship    65        M1234    M9876  70        7       1-1-2013      7
JKL123  AAA     Ship    65        M1234    M9876  70        8       1-1-2013      6

Having gone through Wes' book and numerous somewhat similar posts on this site, I believe I need to be using df.iterrows(), but I am struggling with the process to iterate over both dataframes while also checking and counting for matching values.

This was my latest attempt:

for i in changes.iterrows():
  for i in original.iterrows():
    changes['count'] = 0
    if changes(i) == original(i):
      changes['count'] +=1

Thanks in advance for your time and efforts!

Andy Hayden · Accepted Answer

Here's one way:

Make sure you set the index of both the original and changes DataFrames to be ID:

In [11]: orignal.set_index('ID', inplace=True)

In [12]: original
Out[12]: 
       Code  Mode  Amount   From     To  Weight  Cube Service_Date
ID                                                                
MNO123  BBB   Air      50  M1234  M9876      60     6     1-1-2013
GHI123  AAA   Air      50  M1234  M9876      80     8     1-1-2013
JKL123  AAA  Ship      50  M1234  M9876      70     7     1-1-2013

You'll also have to do a little hack here to allow us to use the eq DataFrame method, unfortunately this does a sort (or you could keep track of the original unique index).

In [13]: changes = changes.set_index('ID').sort_index()

Select the columns you are interested in (alternatively you could just drop the Service_Date column):

In [14]: count_columns = ['Code', 'Mode', 'Amount', 'From', 'To', 'Weight', 'Cube']

Then you can use the eq DataFrame method:

In [15]: changes.eq(original)[count_columns]
Out[15]: 
        Code  Mode Amount  From     To Weight   Cube
ID                                                  
GHI123  True  True  False  True   True   True   True
JKL123  True  True  False  True   True   True   True
JKL123  True  True  False  True   True   True  False
MNO123  True  True   True  True   True   True   True
MNO123  True  True  False  True   True   True   True
MNO123  True  True  False  True  False   True   True

and sum each row:

In [16]: changes.eq(original)[count_columns].sum(1)
Out[16]: 
ID
GHI123    6
JKL123    6
JKL123    5
MNO123    7
MNO123    6
MNO123    5
dtype: int64

In [17]: changes['match'] = changes.eq(original)[count_columns].sum(1).values

In [18]: changes
Out[18]: 
       Code  Mode  Amount   From     To  Weight  Cube Service_Date  match
ID                                                                       
GHI123  AAA   Air      65  M1234  M9876      80     8     1-1-2013      6
JKL123  AAA  Ship      65  M1234  M9876      70     7     1-1-2013      6
JKL123  AAA  Ship      65  M1234  M9876      70     8     1-1-2013      5
MNO123  BBB   Air      50  M1234  M9876      60     6     2-2-2013      7
MNO123  BBB   Air      60  M1234  M9876      60     6     2-2-2013      6
MNO123  BBB   Air      70  M1234  M1111      60     6     2-2-2013      5

Note: counts appear slightly different to yours...

Counting similarities among two pandas dataframes

Answers (2)

Related Questions