Hui LI
Hui LI

Reputation: 41

Polars: multiple levels for certain columns

I have a polars dataframe like this:

Users 2021-04-27 2021-04-28 2021-04-29 2021-04-30 ...
A 2 2 5 6 ...

how can I do to have multi level for columns date and to get a excel file like this ? expected format

I already tried pandas.MultiIndex.

Upvotes: 4

Views: 1074

Answers (1)

jqurious
jqurious

Reputation: 21580

Pandas and Polars both use xlsxwriter to write excel files.

(There are some nice examples in the docs: https://xlsxwriter.readthedocs.io/working_with_polars.html)

To combine cells , pandas uses xlsxwriter's .merge_range() functionality.

It does not look like Polars currently offers a way to use this, it seems like you would have to discuss it as a feature request.


Some of the particular details of your example have been hardcoded, but the general idea could look something like this:

As @Hericks mentioned, in Polars you would usually reshape from wide-to-long e.g. with .unpivot()

import xlsxwriter
import polars as pl

df = pl.from_repr("""
┌───────┬────────────┬────────────┬────────────┬────────────┬────────────┐
│ Users ┆ 2021-04-27 ┆ 2021-04-28 ┆ 2021-04-29 ┆ 2021-04-30 ┆ 2021-03-01 │
│ ---   ┆ ---        ┆ ---        ┆ ---        ┆ ---        ┆ ---        │
│ str   ┆ i64        ┆ i64        ┆ i64        ┆ i64        ┆ i64        │
╞═══════╪════════════╪════════════╪════════════╪════════════╪════════════╡
│ A     ┆ 2          ┆ 2          ┆ 5          ┆ 6          ┆ 1          │
└───────┴────────────┴────────────┴────────────┴────────────┴────────────┘
""")

df_long = (
   df.unpivot(index="Users", variable_name="date")
     .with_columns(
        ym = pl.col("date").str.to_date().dt.to_string("%Y-%m"),
        d  = pl.col("date").str.to_date().dt.to_string("%d"),
     )
     .sort("ym") # to order columns (should be a proper "date sort" instead)
)
┌───────┬────────────┬───────┬─────────┬─────┐
│ Users ┆ date       ┆ value ┆ ym      ┆ d   │
│ ---   ┆ ---        ┆ ---   ┆ ---     ┆ --- │
│ str   ┆ str        ┆ i64   ┆ str     ┆ str │
╞═══════╪════════════╪═══════╪═════════╪═════╡
│ A     ┆ 2021-03-01 ┆ 1     ┆ 2021-03 ┆ 01  │
│ A     ┆ 2021-04-27 ┆ 2     ┆ 2021-04 ┆ 27  │
│ A     ┆ 2021-04-28 ┆ 2     ┆ 2021-04 ┆ 28  │
│ A     ┆ 2021-04-29 ┆ 5     ┆ 2021-04 ┆ 29  │
│ A     ┆ 2021-04-30 ┆ 6     ┆ 2021-04 ┆ 30  │
└───────┴────────────┴───────┴─────────┴─────┘
levels = df_long.group_by("ym", maintain_order=True).agg("d")
shape: (2, 2)
┌─────────┬──────────────────────┐
│ ym      ┆ d                    │
│ ---     ┆ ---                  │
│ str     ┆ list[str]            │
╞═════════╪══════════════════════╡
│ 2021-03 ┆ ["01"]               │
│ 2021-04 ┆ ["27", "28", … "30"] │
└─────────┴──────────────────────┘

Essentially, it just keeps track of the current offset/widths and any multi-width entries are written using .merge_range() instead of .write()

with xlsxwriter.Workbook("polars-multi-level.xlsx") as book:
    sheet = book.add_worksheet()

    cell_format = book.add_format(dict(
        bold = 1,
        border = 1,
        align = "center",
        valign = "vcenter",
    ))

    sheet.merge_range("A1:A2", "Users", cell_format)

    levels = df_long.group_by("ym", maintain_order=True).agg("d")

    row = 0
    offset = 1 # col 0 already contains "Users"
    for idx, (top_level, sub_levels) in enumerate(levels.iter_rows()):
        data = top_level
        first_col = offset + idx
        width = len(sub_levels)

        if width > 1:
            last_col  = offset + width
            sheet.merge_range(
               *[row, first_col, row, last_col, data, cell_format]
            )
            for sub_idx, sub_level in enumerate(sub_levels):
                sheet.write(row + 1, first_col + sub_idx, sub_level, cell_format)
            offset += width
        else:
            sheet.write(row, first_col, data, cell_format)
            sheet.write(row + 1, first_col, sub_levels[0], cell_format)

enter image description here

Upvotes: 4

Related Questions