beginner_
beginner_

Reputation: 7632

pyarrow: fast way to iterate ChunkedArray? (from a table)

The array contains python objects and is part of a table. I need to perform a calculation element-wise. The calculation itself returns a list of numbers which should then be new columns in the table.

I looked at the documentation but don't see anyway to iterate the pyarrow array? Is there a way or to I have to first convert it to a numpy array? (that is what the documentation example of user-defined functions shows)

Upvotes: 0

Views: 1303

Answers (1)

amol
amol

Reputation: 1791

You can iterate ChunkedArrays, they support the iterable protocol

>>> a = pa.chunked_array([[1,2,3], [4,5,6]])
>>> for x in a: print(x)
... 
1
2
3
4
5
6

But that's rarely what you want to do, because it's fairly slow. As much as possible you want to build your algorithm constructing it as a combination of compute functions ( https://arrow.apache.org/docs/python/api/compute.html ) applied to the array.

The User Defined Functions example converts the pyarrow array to a numpy array because it wants to use the numpy.gcd function ( https://numpy.org/doc/stable/reference/generated/numpy.gcd.html ) which requires a numpy array.

Upvotes: 1

Related Questions