Reputation: 51

Why is this polars filtering so much slower than my pandas equivalent?

I'm trying a function in polars, and it is significantly slower than my pandas equivalent.

My pandas function is the following:

import pandas as pd
import time
import numpy as np

target_value = 0.5
data = np.random.rand(1000,100)
df = pd.DataFrame(data)

run_times = []
for i in range(100):
    st = time.perf_counter()
    df_filtered = df.loc[(df[0] - target_value).abs() == (df[0] - target_value).abs().min()]
    run_time = time.perf_counter() - st
    run_times.append(run_time)
print(f"avg pandas run: {sum(run_times)/len(run_times)}")

and polars is the following

import polars as pl
import time
import numpy as np

target_value = 0.5
data = np.random.rand(1000,100)
df = pl.DataFrame(data)

run_times = []
for i in range(100):
    st = time.perf_counter()
    df = df.with_columns(abs_diff = (pl.col('column_0')-target_value).abs())
    df_filtered = df.filter(pl.col('abs_diff') == df['abs_diff'].min())
    run_time = time.perf_counter() - st
    run_times.append(run_time)
print(f"avg polars run: {sum(run_times)/len(run_times)}")

My real datasets have 1,000 to 10,000 rows and 100 columns, and I need to filter through many different datasets. For one example of df shape (1_000, 100), I'm seeing my pandas version is magnitudes faster (0.0006s for pandas and 0.0037s for polars), which was unexpected. Is there a more efficient way to write my polars query? Or is it just expected for pandas to outperform with smaller datasets of this size?

One thing to note, when I test it with 2 columns, polars is faster, and the more columns I add, the slower polars is. On the other hand, polars begins to outperform pandas after about 500_000 rows vs 100 columns.

Additionally in my real use case, I would need to return multiple rows that match the closest value.

Not sure if this is important, but for additional context, I'm running python on a linux server.

Upvotes: 3

Answers (3)

Raymond Han

Reputation: 51

Per Polars Support:

First, you need way more iterations than just 100 for such a small timewindow. With 10,000 iterations I get the following:

avg polars run: 0.0005123567976988852 avg pandas run: 0.00012923809615895151 But we can rewrite the polars query to be more efficient:
df_filtered = (
    df.lazy()
      .with_columns(abs_diff = (pl.col.column_0 - target_value).abs())
      .filter(pl.col.abs_diff == pl.col.abs_diff.min())
      .collect()
)
Then we get:

avg polars run: 0.00018435594723559915 Ultimately Polars isn't optimized for doing many tiny tiny horizontally wide datasets though.

Unfortunately, I didn't experience much of a performance boost when I tried the version above. It does seem the speeds are very machine dependent. I will continue with pandas for this specific use case. Thanks all for looking.

Upvotes: 2

etrotta

Reputation: 268

You could optimize your polars query a bit, specially use expressions instead of df["col"]. Possibly even more so if you don't mind only getting one row out of the query instead of including all values tying for the minimal.

import polars as pl
import time
import numpy as np

target_value = 0.5
data = np.random.rand(1000,100)
df = pl.DataFrame(data)

run_times = []
for i in range(100):
    st = time.time()
    abs_diff = (pl.col('column_0') - target_value).abs()
    # Option A - keep original behaviour but just better optimized
    # df_filtered = df.filter(abs_diff == abs_diff.min())
    # Option B - only get the row with minimal index instead of filtering
    df_filtered = df.row(df.select(abs_diff.arg_min()).item())
    run_time = time.time() - st
    run_times.append(run_time)

print(f"avg polars run: {sum(run_times)/len(run_times)}")

Like others said, numpy (or jax etc.) may as well be better suited for this kind of work though

Upvotes: 2

Lewis

Reputation: 832

Testing your "function" with pandas, polars and numpy

import pandas as pd
import time
import numpy as np
import polars as pl


def test(func, argument):
    run_times = []
    for i in range(100):
        st = time.perf_counter()
        df = func(argument)
        run_time = time.perf_counter() - st
        run_times.append(run_time)
    return np.mean(run_times)

def f_pandas(df):
    min_abs_diff = (df[0] - target_value).abs().min()
    return df.loc[(df[0] - target_value).abs() == min_abs_diff]

def f_pandas_vectorized(df):
    return df.loc[(df[0] - target_value).abs().idxmin()]

def f_polars(df):
    min_abs_diff = (df["column_0"] - target_value).abs().min() 
    return df.filter((df["column_0"] - target_value).abs() == min_abs_diff)

def f_numpy(data):
    abs_diff = np.abs(data[:, 0] - target_value)
    min_idx = np.argmin(abs_diff)
    return pd.DataFrame(data[[min_idx]])


target_value = 0.5
data = np.random.rand(100000, 1000)
df = pd.DataFrame(data)
df_pl = pl.DataFrame(data)

print(f"average pandas runtime: {test(f_pandas, df)}")
print(f"average pandas runtime with idxmin(): {test(f_pandas_vectorized, df)}")
print(f"average polars runtime: {test(f_polars, df_pl)}")
print(f"average numpy runtime: {test(f_numpy, data)}")

I got this results running in a Jupyter Notebook on a Linux machine.

average pandas runtime: 0.00989325414002451
average pandas runtime with idxmin(): 0.005005129760029377
average polars runtime: 0.006758741329904296
average numpy runtime: 0.004175669220221607

average pandas runtime: 0.009967705049803044
average pandas runtime with idxmin(): 0.005097740050114225
average polars runtime: 0.006972378070222476
average numpy runtime: 0.004102102290034964

average pandas runtime: 0.010020545769948512
average pandas runtime with idxmin(): 0.004993948210048984
average polars runtime: 0.007027968560159934
average numpy runtime: 0.004024256040174805

You see polars is faster than your panda code, but using vectorized operations like idxmin() in pandas at least in this case is better than polars. numpy is often faster in this type of numerical work.

Upvotes: 3

Why is this polars filtering so much slower than my pandas equivalent?

Answers (3)

Related Questions