Reputation: 51
I'm trying a function in polars, and it is significantly slower than my pandas equivalent.
My pandas function is the following:
import pandas as pd
import time
import numpy as np
target_value = 0.5
data = np.random.rand(1000,100)
df = pd.DataFrame(data)
run_times = []
for i in range(100):
st = time.perf_counter()
df_filtered = df.loc[(df[0] - target_value).abs() == (df[0] - target_value).abs().min()]
run_time = time.perf_counter() - st
run_times.append(run_time)
print(f"avg pandas run: {sum(run_times)/len(run_times)}")
and polars is the following
import polars as pl
import time
import numpy as np
target_value = 0.5
data = np.random.rand(1000,100)
df = pl.DataFrame(data)
run_times = []
for i in range(100):
st = time.perf_counter()
df = df.with_columns(abs_diff = (pl.col('column_0')-target_value).abs())
df_filtered = df.filter(pl.col('abs_diff') == df['abs_diff'].min())
run_time = time.perf_counter() - st
run_times.append(run_time)
print(f"avg polars run: {sum(run_times)/len(run_times)}")
My real datasets have 1,000 to 10,000 rows and 100 columns, and I need to filter through many different datasets. For one example of df shape (1_000, 100), I'm seeing my pandas version is magnitudes faster (0.0006s for pandas and 0.0037s for polars), which was unexpected. Is there a more efficient way to write my polars query? Or is it just expected for pandas to outperform with smaller datasets of this size?
One thing to note, when I test it with 2 columns, polars is faster, and the more columns I add, the slower polars is. On the other hand, polars begins to outperform pandas after about 500_000 rows vs 100 columns.
Additionally in my real use case, I would need to return multiple rows that match the closest value.
Not sure if this is important, but for additional context, I'm running python on a linux server.
Upvotes: 3
Views: 169
Reputation: 51
Per Polars Support:
First, you need way more iterations than just 100 for such a small timewindow. With 10,000 iterations I get the following:
avg polars run: 0.0005123567976988852 avg pandas run: 0.00012923809615895151 But we can rewrite the polars query to be more efficient:
df_filtered = ( df.lazy() .with_columns(abs_diff = (pl.col.column_0 - target_value).abs()) .filter(pl.col.abs_diff == pl.col.abs_diff.min()) .collect() )
Then we get:
avg polars run: 0.00018435594723559915 Ultimately Polars isn't optimized for doing many tiny tiny horizontally wide datasets though.
Unfortunately, I didn't experience much of a performance boost when I tried the version above. It does seem the speeds are very machine dependent. I will continue with pandas for this specific use case. Thanks all for looking.
Upvotes: 2
Reputation: 268
You could optimize your polars query a bit, specially use expressions instead of df["col"]. Possibly even more so if you don't mind only getting one row out of the query instead of including all values tying for the minimal.
import polars as pl
import time
import numpy as np
target_value = 0.5
data = np.random.rand(1000,100)
df = pl.DataFrame(data)
run_times = []
for i in range(100):
st = time.time()
abs_diff = (pl.col('column_0') - target_value).abs()
# Option A - keep original behaviour but just better optimized
# df_filtered = df.filter(abs_diff == abs_diff.min())
# Option B - only get the row with minimal index instead of filtering
df_filtered = df.row(df.select(abs_diff.arg_min()).item())
run_time = time.time() - st
run_times.append(run_time)
print(f"avg polars run: {sum(run_times)/len(run_times)}")
Like others said, numpy (or jax etc.) may as well be better suited for this kind of work though
Upvotes: 2
Reputation: 832
Testing your "function" with pandas
, polars
and numpy
import pandas as pd
import time
import numpy as np
import polars as pl
def test(func, argument):
run_times = []
for i in range(100):
st = time.perf_counter()
df = func(argument)
run_time = time.perf_counter() - st
run_times.append(run_time)
return np.mean(run_times)
def f_pandas(df):
min_abs_diff = (df[0] - target_value).abs().min()
return df.loc[(df[0] - target_value).abs() == min_abs_diff]
def f_pandas_vectorized(df):
return df.loc[(df[0] - target_value).abs().idxmin()]
def f_polars(df):
min_abs_diff = (df["column_0"] - target_value).abs().min()
return df.filter((df["column_0"] - target_value).abs() == min_abs_diff)
def f_numpy(data):
abs_diff = np.abs(data[:, 0] - target_value)
min_idx = np.argmin(abs_diff)
return pd.DataFrame(data[[min_idx]])
target_value = 0.5
data = np.random.rand(100000, 1000)
df = pd.DataFrame(data)
df_pl = pl.DataFrame(data)
print(f"average pandas runtime: {test(f_pandas, df)}")
print(f"average pandas runtime with idxmin(): {test(f_pandas_vectorized, df)}")
print(f"average polars runtime: {test(f_polars, df_pl)}")
print(f"average numpy runtime: {test(f_numpy, data)}")
I got this results running in a Jupyter Notebook on a Linux machine.
average pandas runtime: 0.00989325414002451
average pandas runtime with idxmin(): 0.005005129760029377
average polars runtime: 0.006758741329904296
average numpy runtime: 0.004175669220221607
average pandas runtime: 0.009967705049803044
average pandas runtime with idxmin(): 0.005097740050114225
average polars runtime: 0.006972378070222476
average numpy runtime: 0.004102102290034964
average pandas runtime: 0.010020545769948512
average pandas runtime with idxmin(): 0.004993948210048984
average polars runtime: 0.007027968560159934
average numpy runtime: 0.004024256040174805
You see polars
is faster than your panda
code, but using vectorized operations like idxmin() in pandas
at least in this case is better than polars
. numpy
is often faster in this type of numerical work.
Upvotes: 3