How to perform a Window Function operation in a Pandas Dataframe using a cumulative sum?

Question

I have an initial dataframe

df1 = 
+---+---+---+
|  A|  B|  C|
+---+---+---+
|  1|  1| 10|
|  1|  2| 11|
|  1|  2| 12|
|  3|  1| 13|
|  2|  1| 14|
|  2|  1| 15|
|  2|  1| 16|
|  4|  1| 17|
|  4|  2| 18|
|  4|  3| 19|
|  4|  4| 19|
|  4|  5| 20|
|  4|  5| 20|
+---+---+---+

Using pyspark I coded the dataframe with a window function using a sum taking into account the column 'A' and taking into account the column 'B' sorted.

spec = Window.partitionBy('A').orderBy('B')
df1 = df1.withColumn('D',sum('C').over(spec))

df1.show()

+---+---+---+-----+
|  A|  B|  C|    D|
+---+---+---+-----+
|  1|  1| 10| 10.0|
|  1|  2| 11| 33.0|
|  1|  2| 12| 33.0|
|  2|  1| 14| 45.0|
|  2|  1| 15| 45.0|
|  2|  1| 16| 45.0|
|  3|  1| 13| 13.0|
|  4|  1| 17| 17.0|
|  4|  2| 18| 35.0|
|  4|  3| 19| 54.0|
|  4|  4| 19| 73.0|
|  4|  5| 20|113.0|
|  4|  5| 20|113.0|
+---+---+---+-----+

Is it possible to do the same calculation using Pandas Dataframe?

I tried using

df['D'] = df.sort_values(['A','B']).groupby(['A', 'B'])['C'].transform('cumsum')

but it is not the same result

Thanks

Henry Ecker · Accepted Answer

In pandas we can groupby sum on A and B. Then groupby cumsum just over A. To add results back to the DataFrame rename to the new column name then join the results back to the DataFrame on the initial group keys ['A', 'B']:

df1 = df1.join(
    df1.groupby(by=['A', 'B'])['C'].sum()
        .groupby(level='A').cumsum()
        .rename('D'),
    on=['A', 'B']
)

df1:

    A  B   C    D
0   1  1  10   10
1   1  2  11   33
2   1  2  12   33
3   3  1  13   13
4   2  1  14   45
5   2  1  15   45
6   2  1  16   45
7   4  1  17   17
8   4  2  18   35
9   4  3  19   54
10  4  4  19   73
11  4  5  20  113
12  4  5  20  113

Setup:

import pandas as pd

df1 = pd.DataFrame({
    'A': [1, 1, 1, 3, 2, 2, 2, 4, 4, 4, 4, 4, 4],
    'B': [1, 2, 2, 1, 1, 1, 1, 1, 2, 3, 4, 5, 5],
    'C': [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 19, 20, 20]
})

How to perform a Window Function operation in a Pandas Dataframe using a cumulative sum?

Answers (2)

Related Questions