ignoring_gravity
ignoring_gravity

Reputation: 10563

cumulative sum per group in PyArrow

In pandas I can do:

import pandas as pd

df = pd.DataFrame({'a': [1,2,3,4,5,6], 'b': ['x']*3+['y']*3})

df.groupby('b')['a'].cumsum()
0     1
1     3
2     6
3     4
4     9
5    15
Name: a, dtype: int64

How can I get the same result in PyArrow starting from

pa.table({'a': [1,2,3,4,5,6], 'b': ['x']*3+['y']*3})

without converting to pandas?

Upvotes: 0

Views: 52

Answers (2)

ouroboros1
ouroboros1

Reputation: 14379

With groups in table['b'] sorted, here's one approach:

import pyarrow as pa
import pyarrow.compute as pc

table = pa.table({'a': [1,2,3,4,5,6], 'b': ['x']*3+['y']*3})

cs = pc.cumulative_sum(table['a'])

gs = table.group_by('b').aggregate([('a', 'sum')])

offset_sum = pa.concat_arrays([
    pa.array([0]),
    gs['a_sum'].chunks[0][:-1]
    ])

a_cumsum = (pc.subtract(cs, 
                        table.join(
                            pa.table({'b': gs['b'], 
                                      'offset_sum': offset_sum}), 
                            'b')
                        ['offset_sum']
                        )
            )

out = pa.table({'a_cumsum': a_cumsum, 'b': table['b']})

Output:

out.to_pandas()

   a_cumsum  b
0         1  x
1         3  x
2         6  x
3         4  y
4         9  y
5        15  y

Explanation / Intermediates

gs

pyarrow.Table
b: string
a_sum: int64
----
b: [["x","y"]]
a_sum: [[6,15]]
  • We want to shift this result, turning [[6,15] into [[0,6]], so that we can subtract these values per group (i.e. cumsum 'y'-values - 6). For this we use pa.concat_arrays.
  • Now, we can map offset_sum via table.join and apply pc.subtract.

On a very small table, you will probably be quicker with a for loop. But this will be much faster on a larger set:

import numpy as np
import pyarrow as pa

N = 100_000

a = np.random.default_rng(0).integers(low=0, high=10, size=N)
b = np.repeat(range(int(N/20)), 20)

table = pa.table({'a': a, 'b': b})

Performance comparison:

  • obtaining a_cumsum as per this answer
  • obtaining cumsum_list in a loop (see answer by @Kelo)
# this answer
5.94 ms ± 303 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# answer @Kelo
579 ms ± 21.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 1

Kelo
Kelo

Reputation: 169

You can achieve what you want, by processing each group in b using the following code:

import pyarrow as pa
import pyarrow.compute as pc

table = pa.table({'a': [1, 2, 3, 4, 5, 6], 'b': ['x']*3 + ['y']*3})

unique_b = pc.unique(table['b'])

cumsum_list = []
b_list = []

for value in unique_b:
    mask = pc.equal(table['b'], value)
    group = table.filter(mask)
    cumsum = pc.cumulative_sum(group['a'])
    cumsum_list.extend(cumsum)
    b_list.extend(group['b'])

final_result = pa.table({'a': cumsum_list, 'b': b_list})

To visualize the result you can convert it back to pandas using:

print(final_result.to_pandas())

which returns the following:

enter image description here

Upvotes: 0

Related Questions