user2242044
user2242044

Reputation: 9213

Efficiently comparing data across rows in a Pandas Dataframe

I have a CSV file of monthly cell phone bills in no particular order that I read into a Pandas Dataframe. I'd like to add a column for each bill that shows how much it differed from the previous bill for the same account. This CSV is just a sub-set of my data. My code works fine, but is pretty sloppy and very slow when you look at a CSV file close to a million rows.

What should I be doing to make this more efficient?

CSV:

Account Number,Bill Month,Bill Amount
4543,3/1/2015,300
4543,1/1/2015,100
4543,2/1/2015,200
2322,1/1/2015,22
2322,3/1/2015,38
2322,2/1/2015,25

Python:

import numpy as np
import pandas as pd
data = pd.read_csv('data.csv', low_memory=False)

# sort my data and reset the index so I can use index and index - 1 in the loop
data = data.sort_values(by=['Account Number', 'Bill Month'])
data = data.reset_index(drop=True)

# add a blank column for the difference
data['Difference'] = np.nan

for index, row in data.iterrows():

    # special handling for the first row so I don't get negative indexes
    if index == 0:
         data.ix[index, 'Difference'] = "-"
    else:
        # if the account in the current row and the row before are the same, then compare Bill Amounts
        if data.ix[index, 'Account Number'] == data.ix[index - 1, 'Account Number']:
            data.ix[index, 'Difference'] = data.ix[index, 'Bill Amount'] - data.ix[index - 1, 'Bill Amount']
        else:
           data.ix[index, 'Difference'] = "-"

print data

Desired Output:

   Account Number Bill Month  Bill Amount Difference
0            2322   1/1/2015           22          -
1            2322   2/1/2015           25          3
2            2322   3/1/2015           38         13
3            4543   1/1/2015          100          -
4            4543   2/1/2015          200        100
5            4543   3/1/2015          300        100

Upvotes: 1

Views: 1067

Answers (2)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210832

try this:

In [37]: df = df.sort_values(['Account Number','Bill Month'])

In [38]: df['Difference'] = (df.groupby(['Account Number'])['Bill Amount']
   ....:                       .diff()
   ....:                       .fillna('-')
   ....:                    )

In [39]: df
Out[39]:
   Account Number Bill Month  Bill Amount Difference
3            2322 2015-01-01           22          -
5            2322 2015-02-01           25          3
4            2322 2015-03-01           38         13
1            4543 2015-01-01          100          -
2            4543 2015-02-01          200        100
0            4543 2015-03-01          300        100

Explanation:

diff() will be applied to each group separately - it'll return the difference between the "next" value and the current value:

In [123]: df.groupby(['Account Number'])['Bill Amount'].diff()
Out[123]:
3      NaN
5      3.0
4     13.0
1      NaN
2    100.0
0    100.0
dtype: float64

fillna('-') - fills all NaN's with the specified value: -:

In [124]: df.groupby(['Account Number'])['Bill Amount'].diff().fillna('-')
Out[124]:
3      -
5      3
4     13
1      -
2    100
0    100
dtype: object

Upvotes: 1

Alexander
Alexander

Reputation: 109546

df = pd.DataFrame({
    'Account Number': {0: 4543, 1: 4543, 2: 4543, 3: 2322, 4: 2322, 5: 2322},
    'Bill Amount': {0: 300.0, 1: 100.0, 2: 200.0, 3: 22.0, 4: 38.0, 5: 25.0},
    'Bill Month': {
        0: pd.Timestamp('2015-03-01 00:00:00'),
        1: pd.Timestamp('2015-01-01 00:00:00'),
        2: pd.Timestamp('2015-02-01 00:00:00'),
        3: pd.Timestamp('2015-01-01 00:00:00'),
        4: pd.Timestamp('2015-03-01 00:00:00'),
        5: pd.Timestamp('2015-02-01 00:00:00')}}

You can group on account number and billing month (which sorts by default), sum the Bill Amount (or just take the first if you are guaranteed to only have one bill per month), group again on the first level of the index (the account number), and take the difference using diff.

>>> (df.groupby(['Account Number', 'Bill Month'])['Bill Amount']
       .sum()
       .groupby(level=0)
       .diff())
Account Number  Bill Month
2322            2015-01-01    NaN
                2015-02-01      3
                2015-03-01     13
4543            2015-01-01    NaN
                2015-02-01    100
                2015-03-01    100

Upvotes: 1

Related Questions