Reputation:
I have a polars dataframe:
pl.DataFrame({'a':[[1,3], [1,5]]})
a
list
[1, 3]
[1, 5]
and I'd like to do some kind of vectorized operation to expand this into:
a
list
[1, 2, 3]
[1, 2, 3, 4, 5]
A solution I've come up with is splitting the array into two columns (init
, and final
), then doing pl.struct(['init', 'final'])
followed by apply
to get the range.
def get_valid_codes(struct: dict) -> list:
code_range = set(range(struct['init'], struct['final'] + 1))
codes = list(set.intersection(valid_codes, code_range))
return codes if codes else [0]
This is slow for my dataset (300M rows) and I'm wondering if there's a better way.
Bonus points if you can figure out how to filter out certain (predefined) values from the lists.
Upvotes: 3
Views: 681
Reputation:
Let's expand the data so we can show some logic for 'bad codes'.
import polars as pl
df = pl.DataFrame({"a": [[1, 3], [1, 5], [7, 9], [3, 7], [9, 13], [5, 11]]})
print(df)
shape: (6, 1)
┌───────────┐
│ a │
│ --- │
│ list[i64] │
╞═══════════╡
│ [1, 3] │
│ [1, 5] │
│ [7, 9] │
│ [3, 7] │
│ [9, 13] │
│ [5, 11] │
└───────────┘
We'll use 6 through 10 as 'bad codes' to weed out.
# pl.Config(fmt_table_cell_list_len=10) # increase list repr
bad_codes = [6, 7, 8, 9, 10]
df.with_columns(
pl.int_ranges(pl.col("a").list.first(), pl.col("a").list.last() + 1)
.list.set_difference(bad_codes)
.list.sort() # set_difference does not retain order
.alias("result")
)
shape: (6, 2)
┌───────────┬─────────────────┐
│ a ┆ result │
│ --- ┆ --- │
│ list[i64] ┆ list[i64] │
╞═══════════╪═════════════════╡
│ [1, 3] ┆ [1, 2, 3] │
│ [1, 5] ┆ [1, 2, 3, 4, 5] │
│ [7, 9] ┆ [] │
│ [3, 7] ┆ [3, 4, 5] │
│ [9, 13] ┆ [11, 12, 13] │
│ [5, 11] ┆ [5, 11] │
└───────────┴─────────────────┘
This algorithm leaves an empty list []
when all codes are "bad codes". If you need a [0]
instead of an empty list, you can use a pl.when
and the .list.len
expression to change those to [0]
.
Upvotes: 3