Alessandro Togni
Alessandro Togni

Reputation: 887

Polars filter dataframe with multilple conditions

I've got this pandas code:

df['date_col'] = pd.to_datetime(df['date_col'], format='%Y-%m-%d')
row['date_col'] = pd.to_datetime(row['date_col'], format='%Y-%m-%d')

df = df[(df['groupby_col'] == row['groupby_col']) & 
        (row['date_col'] - df['date_col'] <= timedelta(days = 10)) & 
        (row['date_col'] - df['date_col'] > timedelta(days = 0))]

row['mean_col]' = df['price_col'].mean()

The name row comes from the fact that this function was applied by a lambda construct.

I'm subsetting df with 2 types of conditions:

I'm pretty sure that filter is the correct module to use:

df.filter(condition_1 & condition_2)

but i'm struggling to write the conditions. In order to embed condition 1 do i have to nest a filter condition or a when is the correct choice? How do i translate the timedelta condition? How do i replicate the lambda approach?

Upvotes: 4

Views: 4645

Answers (2)

roman
roman

Reputation: 117345

It's a bit hard to understand your example without test data. But if I try to create some sample data

import polars as pl
import datetime

df = pl.DataFrame({
   "date_col": ['2023-01-01','2023-01-02', '2023-01-03'],
   "groupby_col": [1,2,3],
})

row = pl.DataFrame({
   "date_col": ['2023-01-07','2023-01-08', '2023-01-25'],
   "groupby_col": [1,2,3],
})

df = df.with_columns(pl.col('date_col').str.to_datetime().cast(pl.Date))
row = row.with_columns(pl.col('date_col').str.to_datetime().cast(pl.Date))

then you can filter on multiple conditions by joining two dataframes first and then filtering:

(df
   .join(row, on=["groupby_col"])
   .filter(
       pl.col("date_col_right") - pl.col("date_col") >= datetime.timedelta(days=0),
       pl.col("date_col_right") - pl.col("date_col") < datetime.timedelta(days=10),
   ).drop('date_col_right')
)

shape: (2, 2)
┌────────────┬─────────────┐
│ date_col   ┆ groupby_col │
│ ---        ┆ ---         │
│ date       ┆ i64         │
╞════════════╪═════════════╡
│ 2023-01-01 ┆ 1           │
│ 2023-01-02 ┆ 2           │
└────────────┴─────────────┘

Upvotes: 2

Hericks
Hericks

Reputation: 9799

We start with some example data.

from datetime import date, timedelta

import polars as pl

df = pl.DataFrame({
    "group": ["A"] * 3 + ["B"] * 3,
    "date": [date(2023, month, 1) for month in range(1, 7)]
})

The dataframe looks as follows.

shape: (6, 2)
┌───────┬────────────┐
│ group ┆ date       │
│ ---   ┆ ---        │
│ str   ┆ date       │
╞═══════╪════════════╡
│ A     ┆ 2023-01-01 │
│ A     ┆ 2023-02-01 │
│ A     ┆ 2023-03-01 │
│ B     ┆ 2023-04-01 │
│ B     ┆ 2023-05-01 │
│ B     ┆ 2023-06-01 │
└───────┴────────────┘

We can then filter the dataframe using a filter on the group column and multiple filters on the date column. Note how the predicates in the pl.DataFrame.filter context are implicitly joined using &.

(
    df
    .filter(
        pl.col("group") == "A",
        pl.col("date") - date(2023, 1, 1) >= timedelta(days=5),
        pl.col("date") - date(2023, 1, 1) <= timedelta(days=50),
    )
)

Output.

shape: (1, 2)
┌───────┬────────────┐
│ group ┆ date       │
│ ---   ┆ ---        │
│ str   ┆ date       │
╞═══════╪════════════╡
│ A     ┆ 2023-02-01 │
└───────┴────────────┘

Upvotes: 7

Related Questions