Cbaderivado
Cbaderivado

Reputation: 21

Cumulative sum with two conditions?

I´m just starting to use python and pandas to improve my workload. I have df as follows:

df = pd.DataFrame({
'Div': [2,2,2,2,3,3,3,3], 
'date': ['01/09/2020', '01/09/2020', '01/09/2020','02/09/2020',
'01/09/2020', '01/09/2020', '01/09/2020', '02/09/2020'],
'income': [1000,1500,1000,500,700,2000,1000,6000],
'total':[0,0,0,0,0,0,0,0]
})

I need to calculate the cumulative total for each row while the div column doesn´t change. So far I´ve managed to do it this way:

df2=df
for i in df.index:
    for j in df2.index:
        if (df.loc[i,'Div']==df2.loc[j,'Div'] and df2.loc[j,'date']<=df.loc[i,'date']):
            df.loc[i,'total']+=df2.loc[j,'income']

The result is this:

Div date income total
2 01/09/2020 1000 3500
2 01/09/2020 1500 3500
2 01/09/2020 1000 3500
2 02/09/2020 500 4000
3 01/09/2020 700 3700
3 01/09/2020 2000 3700
3 01/09/2020 1000 3700
3 02/09/2020 6000 9700

It works, but my original file has 13000 rows and it takes more than 2 hours to finish. I´ve been reading around the web and everywhere it´s said that iteration should be avoided while using pandas but I can´t find a solution that fits my problem.

Is there a better way?

Upvotes: 2

Views: 208

Answers (5)

Mayank Porwal
Mayank Porwal

Reputation: 34046

One nice way of doing this would be using Groupby.transform:

In [2684]: df['total'] = df.groupby(['Div', 'date'])['income'].transform('cumsum')
In [2686]: df['total'] = df.groupby(['Div', 'date'])['total'].transform('max')

In [2687]: df
Out[2687]: 
   Div        date  income  total
0    2  01/09/2020    1000   3500
1    2  01/09/2020    1500   3500
2    2  01/09/2020    1000   3500
3    2  02/09/2020     500    500
4    3  01/09/2020     700   3700
5    3  01/09/2020    2000   3700
6    3  01/09/2020    1000   3700
7    3  02/09/2020    6000   6000

Upvotes: 1

mprivat
mprivat

Reputation: 21902

I'd do this:

precalc = df.groupby(["Div", "date"]).sum().reset_index().set_index("date")
df["total"] = df.apply(lambda row: precalc[precalc["Div"]==row["Div"]][:row["date"]]["income"].sum(), axis=1)

precalc basically sums for all the pairs ("Div", "date"). Then run apply on each row of the table to create the total which is the sum of all precalc rows for the matching Div and with dates ranging from the beginning to the row's date.

My theory here is by doing the pre-calculation ahead of time, you avoid having to do this for no reason while creating the cumulative sum. So essentially assuming there are lots of rows for a given Div and date.

Upvotes: 0

Lambda
Lambda

Reputation: 1392

group by Div and date to use sum and then group by Div to use cumsum, then assign the column total with the calculate result.

result = (
df.set_index(['Div', 'date'])
  .assign(
      total=df.groupby(['Div', 'date'])['income'].sum().groupby("Div").cumsum()
  ).reset_index()
)
result

output:

    Div date        income  total
0   2   01/09/2020  1000    3500
1   2   01/09/2020  1500    3500
2   2   01/09/2020  1000    3500
3   2   02/09/2020  500     4000
4   3   01/09/2020  700     3700
5   3   01/09/2020  2000    3700
6   3   01/09/2020  1000    3700
7   3   02/09/2020  6000    9700

Upvotes: 0

Ferris
Ferris

Reputation: 5601

# step 1 calculate the sum income by group 'Div' and 'date'
df1 = df.groupby(['Div','date'])['income'].sum().reset_index()
#   Div date    income
# 0 2   01/09/2020  3500
# 1 2   02/09/2020  500
# 2 3   01/09/2020  3700
# 3 3   02/09/2020  6000


# step 2 calculate the cumsum income by 'Div'
df1['total'] = df1.groupby('Div')['income'].cumsum()
# Div   date    income  total
# 0 2   01/09/2020  3500    3500
# 1 2   02/09/2020  500 4000
# 2 3   01/09/2020  3700    3700
# 3 3   02/09/2020  6000    9700



# step3 merge the origin df with total column
del df['total']
pd.merge(df, df1[['Div', 'date', 'total']], on=['Div', 'date'], how='left')

#   Div date    income  total
# 0 2   01/09/2020  1000    3500
# 1 2   01/09/2020  1500    3500
# 2 2   01/09/2020  1000    3500
# 3 2   02/09/2020  500 4000
# 4 3   01/09/2020  700 3700
# 5 3   01/09/2020  2000    3700
# 6 3   01/09/2020  1000    3700
# 7 3   02/09/2020  6000    9700

Upvotes: 0

cvanelteren
cvanelteren

Reputation: 1703

You want to

  • compute total as cumulative sum fixing div? Use: groupby

enter image description here

df = pd.DataFrame({
'Div': [2,2,2,2,3,3,3,3], 
'date': ['01/09/2020', '01/09/2020', '01/09/2020','02/09/2020',
'01/09/2020', '01/09/2020', '01/09/2020', '02/09/2020'],
'income': [1000,1500,1000,500,700,2000,1000,6000],
'total':[0,0,0,0,0,0,0,0]
})

df['total'] = df.groupby("Div date".split()).cumsum()
display(df)

Upvotes: 0

Related Questions