Reputation: 10563
In pandas I can do:
import pandas as pd
df = pd.DataFrame({'a': [1,2,3,4,5,6], 'b': ['x']*3+['y']*3})
df.groupby('b')['a'].cumsum()
0 1
1 3
2 6
3 4
4 9
5 15
Name: a, dtype: int64
How can I get the same result in PyArrow starting from
pa.table({'a': [1,2,3,4,5,6], 'b': ['x']*3+['y']*3})
without converting to pandas?
Upvotes: 0
Views: 52
Reputation: 14379
With groups in table['b']
sorted, here's one approach:
import pyarrow as pa
import pyarrow.compute as pc
table = pa.table({'a': [1,2,3,4,5,6], 'b': ['x']*3+['y']*3})
cs = pc.cumulative_sum(table['a'])
gs = table.group_by('b').aggregate([('a', 'sum')])
offset_sum = pa.concat_arrays([
pa.array([0]),
gs['a_sum'].chunks[0][:-1]
])
a_cumsum = (pc.subtract(cs,
table.join(
pa.table({'b': gs['b'],
'offset_sum': offset_sum}),
'b')
['offset_sum']
)
)
out = pa.table({'a_cumsum': a_cumsum, 'b': table['b']})
Output:
out.to_pandas()
a_cumsum b
0 1 x
1 3 x
2 6 x
3 4 y
4 9 y
5 15 y
Explanation / Intermediates
cumsum
for column 'a' with pc.cumulative_sum
.sum
for column 'a' (table.group_by
+ aggregate
).gs
pyarrow.Table
b: string
a_sum: int64
----
b: [["x","y"]]
a_sum: [[6,15]]
[[6,15]
into [[0,6]]
, so that we can subtract these values per group (i.e. cumsum 'y'-values - 6). For this we use pa.concat_arrays
.offset_sum
via table.join
and apply pc.subtract
.On a very small table, you will probably be quicker with a for loop. But this will be much faster on a larger set:
import numpy as np
import pyarrow as pa
N = 100_000
a = np.random.default_rng(0).integers(low=0, high=10, size=N)
b = np.repeat(range(int(N/20)), 20)
table = pa.table({'a': a, 'b': b})
Performance comparison:
# this answer
5.94 ms ± 303 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# answer @Kelo
579 ms ± 21.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Upvotes: 1
Reputation: 169
You can achieve what you want, by processing each group in b
using the following code:
import pyarrow as pa
import pyarrow.compute as pc
table = pa.table({'a': [1, 2, 3, 4, 5, 6], 'b': ['x']*3 + ['y']*3})
unique_b = pc.unique(table['b'])
cumsum_list = []
b_list = []
for value in unique_b:
mask = pc.equal(table['b'], value)
group = table.filter(mask)
cumsum = pc.cumulative_sum(group['a'])
cumsum_list.extend(cumsum)
b_list.extend(group['b'])
final_result = pa.table({'a': cumsum_list, 'b': b_list})
To visualize the result you can convert it back to pandas using:
print(final_result.to_pandas())
which returns the following:
Upvotes: 0