TomNorway
TomNorway

Reputation: 3162

Filtering selected columns based on column aggregate

I wish to select only columns with fewer than 3 unique values. I can generate a boolean mask via pl.all().n_unique() < 3, but I don't know if I can use that mask via the polars API for this.

Currently, I am solving it via python. Is there a more idiomatic way?

import polars as pl, pandas as pd
df = pl.DataFrame({"col1":[1,1,2], "col2":[1,2,3], "col3":[3,3,3]})
# target is:
# df_few_unique = pl.DataFrame({"col1":[1,1,2], "col3":[3,3,3]})

# my attempt:
mask = df.select(pl.all().n_unique() < 3).to_numpy()[0]
cols = [col for col, m in zip(df.columns, mask) if m]
df_few_unique = df.select(cols)
df_few_unique

Equivalent in pandas:

df_pandas = df.to_pandas()
mask = (df_pandas.nunique() < 3)
df_pandas.loc[:, mask]

Upvotes: 3

Views: 1105

Answers (3)

Lars Reimann
Lars Reimann

Reputation: 11

Taking polars PR#4535 into account, you can now write code that is fairly close to pandas:

df = pl.DataFrame({"col1":[1,1,2], "col2":[1,2,3], "col3":[3,3,3]})

mask = df.select(pl.all().n_unique() < 3)

# Can only happen if df has no columns
if mask.is_empty():
    df_few_unique = pl.DataFrame()
else:
    df_few_unique = df[:, mask.row(0)]

You can omit the check for the empty mask, if you know the input DataFrame has at least one column.

Upvotes: 1

user18559875
user18559875

Reputation:

Edit: after some thinking, I discovered an even easier way to do this, one that doesn't rely on boolean masking at all.

pl.select( 
    s for s in df
    if s.n_unique() < 3
)
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col3 │
│ ---  ┆ ---  │
│ i64  ┆ i64  │
╞══════╪══════╡
│ 1    ┆ 3    │
│ 1    ┆ 3    │
│ 2    ┆ 3    │
└──────┴──────┘

Upvotes: 3

figs_and_nuts
figs_and_nuts

Reputation: 5763

The selected answer, though syntactically clean, is inefficient. You can do about better

Let us first include at least two filters rather than just one

Problem: Select only those columns where the number of unique values is between 1 and 200

The thing to consider is that you would need a pass over the data no matter what. So, reading it in is the first step

Then, if you do

pl.select(
    [s for s in df
     if s.n_unique() < 200 and s.n_unique() > 1]
)

You are computing the filters in sequence and also keeping them in memory. Htop confirms that using just one core of the machine The ideal solution is to do it all in parallel.

Let us do a few benchmarks. I am using a 32 cores machine. Parallelism would reduce the time further on machines with more cores

set up the dataframes:

import polars as pl
import numpy as np
df = pl.DataFrame({f'a_{i}':np.random.choice(['a','b','c','d'], 10000000) for i in range(100)})

This would take up about 20 GiB RAM. So, be careful if you want to replicate

Selected solution (htop confirms that this solution uses only one core)

%%timeit
_df = pl.select(
    [s for s in df
     if s.n_unique() < 200 and s.n_unique() > 1]
)
output:
18.7 s ± 92.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Let us now try to run the filters in parallel (htop confirms)

%%timeit
_df = df.select((pl.all().n_unique() < 200) & (pl.all().n_unique() > 1))
output:
1.35 s ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We are still computing every filter twice in the two .n_unique() calls above. Let us do with just one by using in_between (parallel execution - htop confirms)

%%timeit
_df = df.select((pl.all().n_unique().is_between(1,200)))
output:
708 ms ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Btw, if you don't want to remember the APIs like in_between and also not compute the n_unique() twice, you can use the lazy semantics

df_lazy = df.lazy()

Now, try the above solution

%%timeit
_df = df_lazy.select((pl.all().n_unique() < 200) & (pl.all().n_unique() > 1)).collect()
output:
718 ms ± 15.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Upvotes: 2

Related Questions