Reputation: 9213
I have a CSV
file of monthly cell phone bills in no particular order that I read into a Pandas
Dataframe
. I'd like to add a column for each bill that shows how much it differed from the previous bill for the same account. This CSV is just a sub-set of my data. My code works fine, but is pretty sloppy and very slow when you look at a CSV file close to a million rows.
What should I be doing to make this more efficient?
CSV:
Account Number,Bill Month,Bill Amount
4543,3/1/2015,300
4543,1/1/2015,100
4543,2/1/2015,200
2322,1/1/2015,22
2322,3/1/2015,38
2322,2/1/2015,25
Python:
import numpy as np
import pandas as pd
data = pd.read_csv('data.csv', low_memory=False)
# sort my data and reset the index so I can use index and index - 1 in the loop
data = data.sort_values(by=['Account Number', 'Bill Month'])
data = data.reset_index(drop=True)
# add a blank column for the difference
data['Difference'] = np.nan
for index, row in data.iterrows():
# special handling for the first row so I don't get negative indexes
if index == 0:
data.ix[index, 'Difference'] = "-"
else:
# if the account in the current row and the row before are the same, then compare Bill Amounts
if data.ix[index, 'Account Number'] == data.ix[index - 1, 'Account Number']:
data.ix[index, 'Difference'] = data.ix[index, 'Bill Amount'] - data.ix[index - 1, 'Bill Amount']
else:
data.ix[index, 'Difference'] = "-"
print data
Desired Output:
Account Number Bill Month Bill Amount Difference
0 2322 1/1/2015 22 -
1 2322 2/1/2015 25 3
2 2322 3/1/2015 38 13
3 4543 1/1/2015 100 -
4 4543 2/1/2015 200 100
5 4543 3/1/2015 300 100
Upvotes: 1
Views: 1067
Reputation: 210832
try this:
In [37]: df = df.sort_values(['Account Number','Bill Month'])
In [38]: df['Difference'] = (df.groupby(['Account Number'])['Bill Amount']
....: .diff()
....: .fillna('-')
....: )
In [39]: df
Out[39]:
Account Number Bill Month Bill Amount Difference
3 2322 2015-01-01 22 -
5 2322 2015-02-01 25 3
4 2322 2015-03-01 38 13
1 4543 2015-01-01 100 -
2 4543 2015-02-01 200 100
0 4543 2015-03-01 300 100
Explanation:
diff()
will be applied to each group separately - it'll return the difference between the "next" value and the current value:
In [123]: df.groupby(['Account Number'])['Bill Amount'].diff()
Out[123]:
3 NaN
5 3.0
4 13.0
1 NaN
2 100.0
0 100.0
dtype: float64
fillna('-')
- fills all NaN's with the specified value: -
:
In [124]: df.groupby(['Account Number'])['Bill Amount'].diff().fillna('-')
Out[124]:
3 -
5 3
4 13
1 -
2 100
0 100
dtype: object
Upvotes: 1
Reputation: 109546
df = pd.DataFrame({
'Account Number': {0: 4543, 1: 4543, 2: 4543, 3: 2322, 4: 2322, 5: 2322},
'Bill Amount': {0: 300.0, 1: 100.0, 2: 200.0, 3: 22.0, 4: 38.0, 5: 25.0},
'Bill Month': {
0: pd.Timestamp('2015-03-01 00:00:00'),
1: pd.Timestamp('2015-01-01 00:00:00'),
2: pd.Timestamp('2015-02-01 00:00:00'),
3: pd.Timestamp('2015-01-01 00:00:00'),
4: pd.Timestamp('2015-03-01 00:00:00'),
5: pd.Timestamp('2015-02-01 00:00:00')}}
You can group on account number and billing month (which sorts by default), sum the Bill Amount (or just take the first if you are guaranteed to only have one bill per month), group again on the first level of the index (the account number), and take the difference using diff
.
>>> (df.groupby(['Account Number', 'Bill Month'])['Bill Amount']
.sum()
.groupby(level=0)
.diff())
Account Number Bill Month
2322 2015-01-01 NaN
2015-02-01 3
2015-03-01 13
4543 2015-01-01 NaN
2015-02-01 100
2015-03-01 100
Upvotes: 1