supersick
supersick

Reputation: 351

Iterate through a group_by with a tuple like pandas

So when i iterate through a pandas.groupby() what i get back is a tuple. This was important because i could do [x for x in df_pandas.sort('date').groupby('grouping_column')] and then sort this list of tuples based on x[0].

In pandas it's also autosorted after a groupby

I did that to have a constant output in plotly. (Area chart)

Now with polars, i can't do the same. I just get the dataframe back. Is there any way to accomplish the same?

I tried adding a sort('date', 'grouping_column') but it had no effect.

What's in my mind for polars is this:

for value in df.select('grouping_column').unique().to_numpy():
    df = df.filter(pl.col('grouping_column') == value)
    ...

This will in fact give the desired results, because it will always iterate through the same sequence, while the groupby is kinda random and the order doesn't seem to matter at all.

My problem is it that the second solution seems to be not really efficient.

The other thing i could do is

[(sub_df['some_col'].to_numpy()[0], sub_df) for sub_df in df.group_by('some_col')]

Use then pythons sort to sort the list based on key in the tuple x[0] and then reiterate through the list. However this solution seems super ugly as well.

Upvotes: 1

Views: 3771

Answers (1)

user18559875
user18559875

Reputation:

You can use the partition_by function to create a dictionary of key-value pairs, where the keys are your grouping_column and your values are a DataFrame.

For example, let's say we have this data:

import polars as pl
from datetime import datetime

df = pl.DataFrame({"grouping_column": [1, 2, 3], }).join(
    pl.DataFrame(
        {
            "date": pl.date_range(datetime(2020, 1, 1), datetime(2020, 3, 1), "1mo", eager=True),
        }
    ),
    how="cross",
)
df
shape: (9, 2)
┌─────────────────┬────────────┐
│ grouping_column ┆ date       │
│ ---             ┆ ---        │
│ i64             ┆ date       │
╞═════════════════╪════════════╡
│ 1               ┆ 2020-01-01 │
│ 1               ┆ 2020-02-01 │
│ 1               ┆ 2020-03-01 │
│ 2               ┆ 2020-01-01 │
│ 2               ┆ 2020-02-01 │
│ 2               ┆ 2020-03-01 │
│ 3               ┆ 2020-01-01 │
│ 3               ┆ 2020-02-01 │
│ 3               ┆ 2020-03-01 │
└─────────────────┴────────────┘

We can split the DataFrame into a dictionary.

df.partition_by(by='grouping_column', maintain_order=True, as_dict=True)
{(1,): shape: (3, 2)
 ┌─────────────────┬────────────┐
 │ grouping_column ┆ date       │
 │ ---             ┆ ---        │
 │ i64             ┆ date       │
 ╞═════════════════╪════════════╡
 │ 1               ┆ 2020-01-01 │
 │ 1               ┆ 2020-02-01 │
 │ 1               ┆ 2020-03-01 │
 └─────────────────┴────────────┘,
 (2,): shape: (3, 2)
 ┌─────────────────┬────────────┐
 │ grouping_column ┆ date       │
 │ ---             ┆ ---        │
 │ i64             ┆ date       │
 ╞═════════════════╪════════════╡
 │ 2               ┆ 2020-01-01 │
 │ 2               ┆ 2020-02-01 │
 │ 2               ┆ 2020-03-01 │
 └─────────────────┴────────────┘,
 (3,): shape: (3, 2)
 ┌─────────────────┬────────────┐
 │ grouping_column ┆ date       │
 │ ---             ┆ ---        │
 │ i64             ┆ date       │
 ╞═════════════════╪════════════╡
 │ 3               ┆ 2020-01-01 │
 │ 3               ┆ 2020-02-01 │
 │ 3               ┆ 2020-03-01 │
 └─────────────────┴────────────┘}

From there, you can create the tuples using the items method of the Python's dictionary.

for x in df.partition_by(by='grouping_column', maintain_order=True, as_dict=True).items():
    print("next item")
    print(x)
next item
((1,), shape: (3, 2)
┌─────────────────┬────────────┐
│ grouping_column ┆ date       │
│ ---             ┆ ---        │
│ i64             ┆ date       │
╞═════════════════╪════════════╡
│ 1               ┆ 2020-01-01 │
│ 1               ┆ 2020-02-01 │
│ 1               ┆ 2020-03-01 │
└─────────────────┴────────────┘)
next item
((2,), shape: (3, 2)
┌─────────────────┬────────────┐
│ grouping_column ┆ date       │
│ ---             ┆ ---        │
│ i64             ┆ date       │
╞═════════════════╪════════════╡
│ 2               ┆ 2020-01-01 │
│ 2               ┆ 2020-02-01 │
│ 2               ┆ 2020-03-01 │
└─────────────────┴────────────┘)
next item
((3,), shape: (3, 2)
┌─────────────────┬────────────┐
│ grouping_column ┆ date       │
│ ---             ┆ ---        │
│ i64             ┆ date       │
╞═════════════════╪════════════╡
│ 3               ┆ 2020-01-01 │
│ 3               ┆ 2020-02-01 │
│ 3               ┆ 2020-03-01 │
└─────────────────┴────────────┘)

Upvotes: 5

Related Questions