bigdataadd
bigdataadd

Reputation: 251

How to perform a Window Function operation in a Pandas Dataframe using a cumulative sum?

I have an initial dataframe

df1 = 
+---+---+---+
|  A|  B|  C|
+---+---+---+
|  1|  1| 10|
|  1|  2| 11|
|  1|  2| 12|
|  3|  1| 13|
|  2|  1| 14|
|  2|  1| 15|
|  2|  1| 16|
|  4|  1| 17|
|  4|  2| 18|
|  4|  3| 19|
|  4|  4| 19|
|  4|  5| 20|
|  4|  5| 20|
+---+---+---+

Using pyspark I coded the dataframe with a window function using a sum taking into account the column 'A' and taking into account the column 'B' sorted.

spec = Window.partitionBy('A').orderBy('B')
df1 = df1.withColumn('D',sum('C').over(spec))

df1.show()

+---+---+---+-----+
|  A|  B|  C|    D|
+---+---+---+-----+
|  1|  1| 10| 10.0|
|  1|  2| 11| 33.0|
|  1|  2| 12| 33.0|
|  2|  1| 14| 45.0|
|  2|  1| 15| 45.0|
|  2|  1| 16| 45.0|
|  3|  1| 13| 13.0|
|  4|  1| 17| 17.0|
|  4|  2| 18| 35.0|
|  4|  3| 19| 54.0|
|  4|  4| 19| 73.0|
|  4|  5| 20|113.0|
|  4|  5| 20|113.0|
+---+---+---+-----+

Is it possible to do the same calculation using Pandas Dataframe?

I tried using

df['D'] = df.sort_values(['A','B']).groupby(['A', 'B'])['C'].transform('cumsum')

but it is not the same result

Thanks

Upvotes: 1

Views: 996

Answers (2)

ALollz
ALollz

Reputation: 59549

You can sort, then cumsum within 'A', then groupby + max within ['A', 'B'] groups using transform so you can assign the result back.

df1['D'] = (df1.sort_values(['A', 'B'])
               .groupby('A')['C'].cumsum()
               .groupby([df1['A'], df1['B']]).transform('max'))

    A  B   C    D
0   1  1  10   10
1   1  2  11   33
2   1  2  12   33
3   3  1  13   13
4   2  1  14   45
5   2  1  15   45
6   2  1  16   45
7   4  1  17   17
8   4  2  18   35
9   4  3  19   54
10  4  4  19   73
11  4  5  20  113
12  4  5  20  113

Upvotes: 1

Henry Ecker
Henry Ecker

Reputation: 35636

In pandas we can groupby sum on A and B. Then groupby cumsum just over A. To add results back to the DataFrame rename to the new column name then join the results back to the DataFrame on the initial group keys ['A', 'B']:

df1 = df1.join(
    df1.groupby(by=['A', 'B'])['C'].sum()
        .groupby(level='A').cumsum()
        .rename('D'),
    on=['A', 'B']
)

df1:

    A  B   C    D
0   1  1  10   10
1   1  2  11   33
2   1  2  12   33
3   3  1  13   13
4   2  1  14   45
5   2  1  15   45
6   2  1  16   45
7   4  1  17   17
8   4  2  18   35
9   4  3  19   54
10  4  4  19   73
11  4  5  20  113
12  4  5  20  113

Setup:

import pandas as pd

df1 = pd.DataFrame({
    'A': [1, 1, 1, 3, 2, 2, 2, 4, 4, 4, 4, 4, 4],
    'B': [1, 2, 2, 1, 1, 1, 1, 1, 2, 3, 4, 5, 5],
    'C': [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 19, 20, 20]
})

Upvotes: 3

Related Questions