Efficiently comparing data across rows in a Pandas Dataframe

Question

I have a CSV file of monthly cell phone bills in no particular order that I read into a Pandas Dataframe. I'd like to add a column for each bill that shows how much it differed from the previous bill for the same account. This CSV is just a sub-set of my data. My code works fine, but is pretty sloppy and very slow when you look at a CSV file close to a million rows.

What should I be doing to make this more efficient?

CSV:

Account Number,Bill Month,Bill Amount
4543,3/1/2015,300
4543,1/1/2015,100
4543,2/1/2015,200
2322,1/1/2015,22
2322,3/1/2015,38
2322,2/1/2015,25

Python:

import numpy as np
import pandas as pd
data = pd.read_csv('data.csv', low_memory=False)

# sort my data and reset the index so I can use index and index - 1 in the loop
data = data.sort_values(by=['Account Number', 'Bill Month'])
data = data.reset_index(drop=True)

# add a blank column for the difference
data['Difference'] = np.nan

for index, row in data.iterrows():

    # special handling for the first row so I don't get negative indexes
    if index == 0:
         data.ix[index, 'Difference'] = "-"
    else:
        # if the account in the current row and the row before are the same, then compare Bill Amounts
        if data.ix[index, 'Account Number'] == data.ix[index - 1, 'Account Number']:
            data.ix[index, 'Difference'] = data.ix[index, 'Bill Amount'] - data.ix[index - 1, 'Bill Amount']
        else:
           data.ix[index, 'Difference'] = "-"

print data

Desired Output:

   Account Number Bill Month  Bill Amount Difference
0            2322   1/1/2015           22          -
1            2322   2/1/2015           25          3
2            2322   3/1/2015           38         13
3            4543   1/1/2015          100          -
4            4543   2/1/2015          200        100
5            4543   3/1/2015          300        100

MaxU - stand with Ukraine · Accepted Answer

try this:

In [37]: df = df.sort_values(['Account Number','Bill Month'])

In [38]: df['Difference'] = (df.groupby(['Account Number'])['Bill Amount']
   ....:                       .diff()
   ....:                       .fillna('-')
   ....:                    )

In [39]: df
Out[39]:
   Account Number Bill Month  Bill Amount Difference
3            2322 2015-01-01           22          -
5            2322 2015-02-01           25          3
4            2322 2015-03-01           38         13
1            4543 2015-01-01          100          -
2            4543 2015-02-01          200        100
0            4543 2015-03-01          300        100

Explanation:

diff() will be applied to each group separately - it'll return the difference between the "next" value and the current value:

In [123]: df.groupby(['Account Number'])['Bill Amount'].diff()
Out[123]:
3      NaN
5      3.0
4     13.0
1      NaN
2    100.0
0    100.0
dtype: float64

fillna('-') - fills all NaN's with the specified value: -:

In [124]: df.groupby(['Account Number'])['Bill Amount'].diff().fillna('-')
Out[124]:
3      -
5      3
4     13
1      -
2    100
0    100
dtype: object

Efficiently comparing data across rows in a Pandas Dataframe

Answers (2)

Related Questions