How to turn `[1, 5]` into `[1, 2, 3, 4, 5]` in a DataFrame column of list type?

Question

I have a polars dataframe:

pl.DataFrame({'a':[[1,3], [1,5]]})

a
list
[1, 3]
[1, 5]

and I'd like to do some kind of vectorized operation to expand this into:

a
list
[1, 2, 3]
[1, 2, 3, 4, 5]

A solution I've come up with is splitting the array into two columns (init, and final), then doing pl.struct(['init', 'final']) followed by apply to get the range.

def get_valid_codes(struct: dict) -> list:
    code_range = set(range(struct['init'], struct['final'] + 1))
    codes      =  list(set.intersection(valid_codes, code_range))
    return codes if codes else [0]

This is slow for my dataset (300M rows) and I'm wondering if there's a better way.

Bonus points if you can figure out how to filter out certain (predefined) values from the lists.

user18559875 · Accepted Answer

Let's expand the data so we can show some logic for 'bad codes'.

import polars as pl

df = pl.DataFrame({"a": [[1, 3], [1, 5], [7, 9], [3, 7], [9, 13], [5, 11]]})
print(df)

shape: (6, 1)
┌───────────┐
│ a         │
│ ---       │
│ list[i64] │
╞═══════════╡
│ [1, 3]    │
│ [1, 5]    │
│ [7, 9]    │
│ [3, 7]    │
│ [9, 13]   │
│ [5, 11]   │
└───────────┘

We'll use 6 through 10 as 'bad codes' to weed out.

# pl.Config(fmt_table_cell_list_len=10) # increase list repr

bad_codes = [6, 7, 8, 9, 10]

df.with_columns(
    pl.int_ranges(pl.col("a").list.first(), pl.col("a").list.last() + 1)
      .list.set_difference(bad_codes)
      .list.sort() # set_difference does not retain order
      .alias("result")
)

shape: (6, 2)
┌───────────┬─────────────────┐
│ a         ┆ result          │
│ ---       ┆ ---             │
│ list[i64] ┆ list[i64]       │
╞═══════════╪═════════════════╡
│ [1, 3]    ┆ [1, 2, 3]       │
│ [1, 5]    ┆ [1, 2, 3, 4, 5] │
│ [7, 9]    ┆ []              │
│ [3, 7]    ┆ [3, 4, 5]       │
│ [9, 13]   ┆ [11, 12, 13]    │
│ [5, 11]   ┆ [5, 11]         │
└───────────┴─────────────────┘

This algorithm leaves an empty list [] when all codes are "bad codes". If you need a [0] instead of an empty list, you can use a pl.when and the .list.len expression to change those to [0].

How to turn `[1, 5]` into `[1, 2, 3, 4, 5]` in a DataFrame column of list type?

Answers (1)

Related Questions