Reputation: 21
I´m just starting to use python and pandas to improve my workload. I have df as follows:
df = pd.DataFrame({
'Div': [2,2,2,2,3,3,3,3],
'date': ['01/09/2020', '01/09/2020', '01/09/2020','02/09/2020',
'01/09/2020', '01/09/2020', '01/09/2020', '02/09/2020'],
'income': [1000,1500,1000,500,700,2000,1000,6000],
'total':[0,0,0,0,0,0,0,0]
})
I need to calculate the cumulative total for each row while the div column doesn´t change. So far I´ve managed to do it this way:
df2=df
for i in df.index:
for j in df2.index:
if (df.loc[i,'Div']==df2.loc[j,'Div'] and df2.loc[j,'date']<=df.loc[i,'date']):
df.loc[i,'total']+=df2.loc[j,'income']
The result is this:
Div | date | income | total |
---|---|---|---|
2 | 01/09/2020 | 1000 | 3500 |
2 | 01/09/2020 | 1500 | 3500 |
2 | 01/09/2020 | 1000 | 3500 |
2 | 02/09/2020 | 500 | 4000 |
3 | 01/09/2020 | 700 | 3700 |
3 | 01/09/2020 | 2000 | 3700 |
3 | 01/09/2020 | 1000 | 3700 |
3 | 02/09/2020 | 6000 | 9700 |
It works, but my original file has 13000 rows and it takes more than 2 hours to finish. I´ve been reading around the web and everywhere it´s said that iteration should be avoided while using pandas but I can´t find a solution that fits my problem.
Is there a better way?
Upvotes: 2
Views: 208
Reputation: 34046
One nice way of doing this would be using Groupby.transform
:
In [2684]: df['total'] = df.groupby(['Div', 'date'])['income'].transform('cumsum')
In [2686]: df['total'] = df.groupby(['Div', 'date'])['total'].transform('max')
In [2687]: df
Out[2687]:
Div date income total
0 2 01/09/2020 1000 3500
1 2 01/09/2020 1500 3500
2 2 01/09/2020 1000 3500
3 2 02/09/2020 500 500
4 3 01/09/2020 700 3700
5 3 01/09/2020 2000 3700
6 3 01/09/2020 1000 3700
7 3 02/09/2020 6000 6000
Upvotes: 1
Reputation: 21902
I'd do this:
precalc = df.groupby(["Div", "date"]).sum().reset_index().set_index("date")
df["total"] = df.apply(lambda row: precalc[precalc["Div"]==row["Div"]][:row["date"]]["income"].sum(), axis=1)
precalc basically sums for all the pairs ("Div", "date"). Then run apply
on each row of the table to create the total
which is the sum of all precalc rows for the matching Div and with dates ranging from the beginning to the row's date.
My theory here is by doing the pre-calculation ahead of time, you avoid having to do this for no reason while creating the cumulative sum. So essentially assuming there are lots of rows for a given Div and date.
Upvotes: 0
Reputation: 1392
group by Div
and date
to use sum
and then group by Div
to use cumsum
, then assign the column total
with the calculate result.
result = (
df.set_index(['Div', 'date'])
.assign(
total=df.groupby(['Div', 'date'])['income'].sum().groupby("Div").cumsum()
).reset_index()
)
result
output:
Div date income total
0 2 01/09/2020 1000 3500
1 2 01/09/2020 1500 3500
2 2 01/09/2020 1000 3500
3 2 02/09/2020 500 4000
4 3 01/09/2020 700 3700
5 3 01/09/2020 2000 3700
6 3 01/09/2020 1000 3700
7 3 02/09/2020 6000 9700
Upvotes: 0
Reputation: 5601
# step 1 calculate the sum income by group 'Div' and 'date'
df1 = df.groupby(['Div','date'])['income'].sum().reset_index()
# Div date income
# 0 2 01/09/2020 3500
# 1 2 02/09/2020 500
# 2 3 01/09/2020 3700
# 3 3 02/09/2020 6000
# step 2 calculate the cumsum income by 'Div'
df1['total'] = df1.groupby('Div')['income'].cumsum()
# Div date income total
# 0 2 01/09/2020 3500 3500
# 1 2 02/09/2020 500 4000
# 2 3 01/09/2020 3700 3700
# 3 3 02/09/2020 6000 9700
# step3 merge the origin df with total column
del df['total']
pd.merge(df, df1[['Div', 'date', 'total']], on=['Div', 'date'], how='left')
# Div date income total
# 0 2 01/09/2020 1000 3500
# 1 2 01/09/2020 1500 3500
# 2 2 01/09/2020 1000 3500
# 3 2 02/09/2020 500 4000
# 4 3 01/09/2020 700 3700
# 5 3 01/09/2020 2000 3700
# 6 3 01/09/2020 1000 3700
# 7 3 02/09/2020 6000 9700
Upvotes: 0
Reputation: 1703
You want to
groupby
df = pd.DataFrame({
'Div': [2,2,2,2,3,3,3,3],
'date': ['01/09/2020', '01/09/2020', '01/09/2020','02/09/2020',
'01/09/2020', '01/09/2020', '01/09/2020', '02/09/2020'],
'income': [1000,1500,1000,500,700,2000,1000,6000],
'total':[0,0,0,0,0,0,0,0]
})
df['total'] = df.groupby("Div date".split()).cumsum()
display(df)
Upvotes: 0