Reputation: 887
I've got this pandas code:
df['date_col'] = pd.to_datetime(df['date_col'], format='%Y-%m-%d')
row['date_col'] = pd.to_datetime(row['date_col'], format='%Y-%m-%d')
df = df[(df['groupby_col'] == row['groupby_col']) &
(row['date_col'] - df['date_col'] <= timedelta(days = 10)) &
(row['date_col'] - df['date_col'] > timedelta(days = 0))]
row['mean_col]' = df['price_col'].mean()
The name row
comes from the fact that this function was applied by a lambda construct.
I'm subsetting df with 2 types of conditions:
I'm pretty sure that filter
is the correct module to use:
df.filter(condition_1 & condition_2)
but i'm struggling to write the conditions.
In order to embed condition 1 do i have to nest a filter condition or a when
is the correct choice?
How do i translate the timedelta
condition?
How do i replicate the lambda approach?
Upvotes: 4
Views: 4645
Reputation: 117345
It's a bit hard to understand your example without test data. But if I try to create some sample data
import polars as pl
import datetime
df = pl.DataFrame({
"date_col": ['2023-01-01','2023-01-02', '2023-01-03'],
"groupby_col": [1,2,3],
})
row = pl.DataFrame({
"date_col": ['2023-01-07','2023-01-08', '2023-01-25'],
"groupby_col": [1,2,3],
})
df = df.with_columns(pl.col('date_col').str.to_datetime().cast(pl.Date))
row = row.with_columns(pl.col('date_col').str.to_datetime().cast(pl.Date))
then you can filter on multiple conditions by joining two dataframes first and then filtering:
(df
.join(row, on=["groupby_col"])
.filter(
pl.col("date_col_right") - pl.col("date_col") >= datetime.timedelta(days=0),
pl.col("date_col_right") - pl.col("date_col") < datetime.timedelta(days=10),
).drop('date_col_right')
)
shape: (2, 2)
┌────────────┬─────────────┐
│ date_col ┆ groupby_col │
│ --- ┆ --- │
│ date ┆ i64 │
╞════════════╪═════════════╡
│ 2023-01-01 ┆ 1 │
│ 2023-01-02 ┆ 2 │
└────────────┴─────────────┘
Upvotes: 2
Reputation: 9799
We start with some example data.
from datetime import date, timedelta
import polars as pl
df = pl.DataFrame({
"group": ["A"] * 3 + ["B"] * 3,
"date": [date(2023, month, 1) for month in range(1, 7)]
})
The dataframe looks as follows.
shape: (6, 2)
┌───────┬────────────┐
│ group ┆ date │
│ --- ┆ --- │
│ str ┆ date │
╞═══════╪════════════╡
│ A ┆ 2023-01-01 │
│ A ┆ 2023-02-01 │
│ A ┆ 2023-03-01 │
│ B ┆ 2023-04-01 │
│ B ┆ 2023-05-01 │
│ B ┆ 2023-06-01 │
└───────┴────────────┘
We can then filter the dataframe using a filter on the group
column and multiple filters on the date
column. Note how the predicates in the pl.DataFrame.filter
context are implicitly joined using &
.
(
df
.filter(
pl.col("group") == "A",
pl.col("date") - date(2023, 1, 1) >= timedelta(days=5),
pl.col("date") - date(2023, 1, 1) <= timedelta(days=50),
)
)
Output.
shape: (1, 2)
┌───────┬────────────┐
│ group ┆ date │
│ --- ┆ --- │
│ str ┆ date │
╞═══════╪════════════╡
│ A ┆ 2023-02-01 │
└───────┴────────────┘
Upvotes: 7