Reputation: 1
I'm building out a library that utilizes pyarrow for data manipulation as I found it to be far and away faster than compared to Pandas, and I've found that working with it directly over Polars has performance benefits in that regard. My problem comes when I try to do any form of data manipulation via the compute module, given that normal operators don't function with pyarrow scalar or array data as far as I've been able to gather.
To give further context, I'm going to give a Basic python version of what I'm trying to do, and then a pyarrow version of the code as well as execution timings to show the problem:
# a, b, are random non negative, nonzero floats, x is a nonzero float that is static.
y = (a-b)/x
This math function was then made into 2 functions to test the performance, one for normal python, one for pyarrow:
a=5.0
b=1.2
x=15.4
def test_fun():
y=(a-b)/x
return y
ap=pa.scalar(5.0)
bp=pa.scalar(1.2)
xp=pa.scalar(15.4)
def test_fun_pa():
y=pc.divide((pc.subtract(ap,bp)),xp)
return y
I then used the timeit module to test these two calls over 50,000 iterations to show the performance difference:
iterations = 50000
total_time_py = timeit("test_fun()", number=iterations, globals=globals())
total_time_pa = timeit("test_fun_pa()", number=iterations, globals=globals())
In which case the answer that I got is this:
Total time ran for normal py math function: 0.0026655999245122075 seconds, mean iteration time: 5.331199849024415e-08 seconds.
Total time ran for pyarrow math function: 0.4460362000390887 seconds, mean iteration time: 8.920724000781774e-06 seconds.
This effectively nullified the gains I got from switching to pyarrow in the first place. I'm not sure how to get around this. I suppose my question would be, what would be the way to have the relatively high performance of pythonic operations whilst being able to keep utilizing pyarrow data for the much faster I/O?
The only thing that I've tried so far is going through and seeing what can use normal operators with pyarrow data, as well as see what other parts of code might be affected by not being able to use normal operators. Through this exploration, I've not found anything, in the docs, elsewhere on the internet, or in my testing to find a solution.
Upvotes: 0
Views: 54
Reputation: 1
I actually found an answer searching through the docs again. There's a command called .as_py() that is related to each scalar type, for example:
https://arrow.apache.org/docs/python/generated/pyarrow.BinaryScalar.html#pyarrow.BinaryScalar.as_py
In my code, where I instantiated the variables, if you call this function, it will convert the data to pydata, in which you can use normal python operations on said data.
# Assume that the variables are being read in from somewhere as pyarrow scalars
ap=pa.scalar(5.0).as_py()
bp=pa.scalar(1.2).as_py()
xp=pa.scalar(15.4).as_py()
def test_fun_pa():
y=(ap-bp)/xp
return y
When I ran this code and profiled it, I got:
Total time ran for pyarrow math function: 0.0025354999816045165 seconds, mean iteration time: 5.070999963209033e-08 seconds.
Total time ran for pyarrow math function: 0.002525599906221032 seconds, mean iteration time: 5.0511998124420645e-08 seconds.
Which is effectively the same performance. So the method is to, when a variable that needs to be calculated, converted into pydata, as the variable is made.
Upvotes: 0