How to solve Pyarrow compute module having significant performance degradation

Question

I'm building out a library that utilizes pyarrow for data manipulation as I found it to be far and away faster than compared to Pandas, and I've found that working with it directly over Polars has performance benefits in that regard. My problem comes when I try to do any form of data manipulation via the compute module, given that normal operators don't function with pyarrow scalar or array data as far as I've been able to gather.

To give further context, I'm going to give a Basic python version of what I'm trying to do, and then a pyarrow version of the code as well as execution timings to show the problem:

# a, b, are random non negative, nonzero floats, x is a nonzero float that is static.

y = (a-b)/x

This math function was then made into 2 functions to test the performance, one for normal python, one for pyarrow:

a=5.0
b=1.2
x=15.4

def test_fun():
    y=(a-b)/x
    return y


ap=pa.scalar(5.0)
bp=pa.scalar(1.2)
xp=pa.scalar(15.4)

def test_fun_pa():
    y=pc.divide((pc.subtract(ap,bp)),xp)
    return y

I then used the timeit module to test these two calls over 50,000 iterations to show the performance difference:

iterations = 50000
total_time_py = timeit("test_fun()", number=iterations, globals=globals())
total_time_pa = timeit("test_fun_pa()", number=iterations, globals=globals())

In which case the answer that I got is this:

Total time ran for normal py math function: 0.0026655999245122075 seconds, mean iteration time: 5.331199849024415e-08 seconds.

Total time ran for pyarrow math function: 0.4460362000390887 seconds, mean iteration time: 8.920724000781774e-06 seconds.

This effectively nullified the gains I got from switching to pyarrow in the first place. I'm not sure how to get around this. I suppose my question would be, what would be the way to have the relatively high performance of pythonic operations whilst being able to keep utilizing pyarrow data for the much faster I/O?

The only thing that I've tried so far is going through and seeing what can use normal operators with pyarrow data, as well as see what other parts of code might be affected by not being able to use normal operators. Through this exploration, I've not found anything, in the docs, elsewhere on the internet, or in my testing to find a solution.

How to solve Pyarrow compute module having significant performance degradation

Answers (1)

Related Questions