Reputation: 63
I am relatively new to python, but have been using Wes Kinney's, "Python for Data Analysis" religiously for the last few weeks. I have spent hours trying to figure out a solution to my current issue, but I need some help. I have a data set containing details for shipments that were sent in this calendar year, however; as I get new data each month, some details may have changed. I have figured out how to identify the shipments that have changed, as well as how to identify what those changes may be.
So, let's say I've identified that these shipments (in dataframe, original) have changed:
ID Code Mode Amount From To Weight Cube Service_Date
MNO123 BBB Air 50 M1234 M9876 60 6 1-1-2013
GHI123 AAA Air 50 M1234 M9876 80 8 1-1-2013
JKL123 AAA Ship 50 M1234 M9876 70 7 1-1-2013
And I've identified the potential changes (in a dataframe, changes) as:
ID Code Mode Amount From To Weight Cube Service_Date
MNO123 BBB Air 50 M1234 M9876 60 6 2-2-2013
MNO123 BBB Air 60 M1234 M9876 60 6 2-2-2013
MNO123 BBB Air 70 M1234 M1111 60 6 2-2-2013
GHI123 AAA Air 65 M1234 M9876 80 8 1-1-2013
JKL123 AAA Ship 65 M1234 M9876 70 7 1-1-2013
JKL123 AAA Ship 65 M1234 M9876 70 8 1-1-2013
All I'm trying to do is add a count column to the changes dataframe which totals the amount of values that match the respective values in the original dataframe. So, because code, mode, amount, from, to, weight, and cube match, the count column would get the value of 7 for the first observation. Similarly, but having one less matching value, the second observation would have a count value of 6 and the third would have a count value of 5.
The result I am looking for is as follows:
ID Code Mode Amount From To Weight Cube Service_Date Count
MNO123 BBB Air 50 M1234 M9876 60 6 2-2-2013 7
MNO123 BBB Air 60 M1234 M9876 60 6 2-2-2013 6
MNO123 BBB Air 70 M1234 M1111 60 6 2-2-2013 5
GHI123 AAA Air 65 M1234 M9876 80 8 1-1-2013 7
JKL123 AAA Ship 65 M1234 M9876 70 7 1-1-2013 7
JKL123 AAA Ship 65 M1234 M9876 70 8 1-1-2013 6
Having gone through Wes' book and numerous somewhat similar posts on this site, I believe I need to be using df.iterrows()
, but I am struggling with the process to iterate over both dataframes while also checking and counting for matching values.
This was my latest attempt:
for i in changes.iterrows():
for i in original.iterrows():
changes['count'] = 0
if changes(i) == original(i):
changes['count'] +=1
Thanks in advance for your time and efforts!
Upvotes: 3
Views: 1361
Reputation: 375425
Here's one way:
Make sure you set the index of both the original and changes DataFrames to be ID:
In [11]: orignal.set_index('ID', inplace=True)
In [12]: original
Out[12]:
Code Mode Amount From To Weight Cube Service_Date
ID
MNO123 BBB Air 50 M1234 M9876 60 6 1-1-2013
GHI123 AAA Air 50 M1234 M9876 80 8 1-1-2013
JKL123 AAA Ship 50 M1234 M9876 70 7 1-1-2013
You'll also have to do a little hack here to allow us to use the eq DataFrame method, unfortunately this does a sort (or you could keep track of the original unique index).
In [13]: changes = changes.set_index('ID').sort_index()
Select the columns you are interested in (alternatively you could just drop the Service_Date column):
In [14]: count_columns = ['Code', 'Mode', 'Amount', 'From', 'To', 'Weight', 'Cube']
Then you can use the eq DataFrame method:
In [15]: changes.eq(original)[count_columns]
Out[15]:
Code Mode Amount From To Weight Cube
ID
GHI123 True True False True True True True
JKL123 True True False True True True True
JKL123 True True False True True True False
MNO123 True True True True True True True
MNO123 True True False True True True True
MNO123 True True False True False True True
and sum each row:
In [16]: changes.eq(original)[count_columns].sum(1)
Out[16]:
ID
GHI123 6
JKL123 6
JKL123 5
MNO123 7
MNO123 6
MNO123 5
dtype: int64
In [17]: changes['match'] = changes.eq(original)[count_columns].sum(1).values
In [18]: changes
Out[18]:
Code Mode Amount From To Weight Cube Service_Date match
ID
GHI123 AAA Air 65 M1234 M9876 80 8 1-1-2013 6
JKL123 AAA Ship 65 M1234 M9876 70 7 1-1-2013 6
JKL123 AAA Ship 65 M1234 M9876 70 8 1-1-2013 5
MNO123 BBB Air 50 M1234 M9876 60 6 2-2-2013 7
MNO123 BBB Air 60 M1234 M9876 60 6 2-2-2013 6
MNO123 BBB Air 70 M1234 M1111 60 6 2-2-2013 5
Note: counts appear slightly different to yours...
Upvotes: 1
Reputation: 40963
You don't need to iterate the rows:
def count_equal(row, original, ID):
"""Counts the number of equal elements between row and original.ix[ID]"""
equal_values = (row == original[original.ID == ID]).values
return equal_values.sum() - 1 # substract 1 because ID doesn't count
changes['count'] = changes.apply(count_equal, args=(original, 'MNO123'), axis=1)
Upvotes: 0