Reputation: 2617
Is there an equivalent way to to df.groupby().shift in polars? Use pandas.shift() within a group
Upvotes: 10
Views: 6577
Reputation:
You can use the over
expression to accomplish this in Polars. Using the example from the link...
import polars as pl
df = pl.DataFrame({
"a": ["a", "a", "b", "b", "b"],
"b": [1, 2, 3, 5, 3],
"c": [5, 4, 3, 2, 1],
})
df.with_columns(
pl.col("b").shift().over("a").alias("prev_value")
)
shape: (5, 4)
┌─────┬─────┬─────┬────────────┐
│ a ┆ b ┆ c ┆ prev_value │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪════════════╡
│ a ┆ 1 ┆ 5 ┆ null │
│ a ┆ 2 ┆ 4 ┆ 1 │
│ b ┆ 3 ┆ 3 ┆ null │
│ b ┆ 5 ┆ 2 ┆ 3 │
│ b ┆ 3 ┆ 1 ┆ 5 │
└─────┴─────┴─────┴────────────┘
To perform this on more than one column, you can specify the columns in the pl.col
expression, and then use a prefix/suffix to name the new columns. For example:
df.with_columns(
pl.col("b", "c").shift().over("a").name.prefix("prev_")
)
shape: (5, 5)
┌─────┬─────┬─────┬────────┬────────┐
│ a ┆ b ┆ c ┆ prev_b ┆ prev_c │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪════════╪════════╡
│ a ┆ 1 ┆ 5 ┆ null ┆ null │
│ a ┆ 2 ┆ 4 ┆ 1 ┆ 5 │
│ b ┆ 3 ┆ 3 ┆ null ┆ null │
│ b ┆ 5 ┆ 2 ┆ 3 ┆ 3 │
│ b ┆ 3 ┆ 1 ┆ 5 ┆ 2 │
└─────┴─────┴─────┴────────┴────────┘
over
Let"s use this data.
df = pl.DataFrame(
{
"id": [1] * 5 + [2] * 5,
"date": ["2020-01-01", "2020-01-01", "2020-02-01", "2020-02-01", "2020-02-01"] * 2,
"value1": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"value2": [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
}
).with_columns(pl.col("date").str.to_date())
df
shape: (10, 4)
┌─────┬────────────┬────────┬────────┐
│ id ┆ date ┆ value1 ┆ value2 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ i64 ┆ i64 │
╞═════╪════════════╪════════╪════════╡
│ 1 ┆ 2020-01-01 ┆ 1 ┆ 10 │
│ 1 ┆ 2020-01-01 ┆ 2 ┆ 20 │
│ 1 ┆ 2020-02-01 ┆ 3 ┆ 30 │
│ 1 ┆ 2020-02-01 ┆ 4 ┆ 40 │
│ 1 ┆ 2020-02-01 ┆ 5 ┆ 50 │
│ 2 ┆ 2020-01-01 ┆ 6 ┆ 60 │
│ 2 ┆ 2020-01-01 ┆ 7 ┆ 70 │
│ 2 ┆ 2020-02-01 ┆ 8 ┆ 80 │
│ 2 ┆ 2020-02-01 ┆ 9 ┆ 90 │
│ 2 ┆ 2020-02-01 ┆ 10 ┆ 100 │
└─────┴────────────┴────────┴────────┘
We can place a list of our grouping variables in the over
expression (as well as a list in our pl.col
expression). Polars will run them all in parallel.
df.with_columns(
pl.col("value1", "value2").shift().over("id", "date").name.prefix("prev_"),
pl.col("value1", "value2").diff().over("id", "date").name.suffix("_diff")
)
shape: (10, 8)
┌─────┬────────────┬────────┬────────┬─────────────┬─────────────┬─────────────┬─────────────┐
│ id ┆ date ┆ value1 ┆ value2 ┆ prev_value1 ┆ prev_value2 ┆ value1_diff ┆ value2_diff │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪════════════╪════════╪════════╪═════════════╪═════════════╪═════════════╪═════════════╡
│ 1 ┆ 2020-01-01 ┆ 1 ┆ 10 ┆ null ┆ null ┆ null ┆ null │
│ 1 ┆ 2020-01-01 ┆ 2 ┆ 20 ┆ 1 ┆ 10 ┆ 1 ┆ 10 │
│ 1 ┆ 2020-02-01 ┆ 3 ┆ 30 ┆ null ┆ null ┆ null ┆ null │
│ 1 ┆ 2020-02-01 ┆ 4 ┆ 40 ┆ 3 ┆ 30 ┆ 1 ┆ 10 │
│ 1 ┆ 2020-02-01 ┆ 5 ┆ 50 ┆ 4 ┆ 40 ┆ 1 ┆ 10 │
│ 2 ┆ 2020-01-01 ┆ 6 ┆ 60 ┆ null ┆ null ┆ null ┆ null │
│ 2 ┆ 2020-01-01 ┆ 7 ┆ 70 ┆ 6 ┆ 60 ┆ 1 ┆ 10 │
│ 2 ┆ 2020-02-01 ┆ 8 ┆ 80 ┆ null ┆ null ┆ null ┆ null │
│ 2 ┆ 2020-02-01 ┆ 9 ┆ 90 ┆ 8 ┆ 80 ┆ 1 ┆ 10 │
│ 2 ┆ 2020-02-01 ┆ 10 ┆ 100 ┆ 9 ┆ 90 ┆ 1 ┆ 10 │
└─────┴────────────┴────────┴────────┴─────────────┴─────────────┴─────────────┴─────────────┘
Upvotes: 24