velochy
velochy

Reputation: 444

Softmax with polars Lazy Dataframe

I'm relatively new to using polars and it seems to be very verbose compared to pandas for what I would consider even relatively basic manipulations.

Case in point, the shortest way I could figure out doing a softmax over a lazy dataframe is the following:

import polars as pl

data = pl.DataFrame({'a': [1,2,3,4,5,6,7,8,9,10], 'b':[5,5,5,5,5,5,5,5,5,5], 'c': [10,9,8,7,6,5,4,3,2,1]}).lazy()
cols = ['a','b','c']

data = data.with_columns([ pl.col(c).exp().alias(c) for c in cols]) # Exp all columns
data = data.with_columns(pl.sum_horizontal(cols).alias('sum')) # Get row sum of exps
data = data.with_columns([ (pl.col(c)/pl.col('sum')).alias(c) for c in cols ]).drop('sum')

data.collect()

Am I missing something and is there a shorter, more readable way of achieving this?

Upvotes: 2

Views: 65

Answers (1)

jqurious
jqurious

Reputation: 21580

You would use a multi-col selection e.g. pl.all() instead of list comprehensions.

(Or pl.col(cols) for a named "subset" of columns)

df.with_columns(
    pl.all().exp() / pl.sum_horizontal(pl.all().exp())
)
shape: (10, 3)
┌──────────┬──────────┬──────────┐
│ a        ┆ b        ┆ c        │
│ ---      ┆ ---      ┆ ---      │
│ f64      ┆ f64      ┆ f64      │
╞══════════╪══════════╪══════════╡
│ 0.000123 ┆ 0.006692 ┆ 0.993185 │
│ 0.000895 ┆ 0.01797  ┆ 0.981135 │
│ 0.006377 ┆ 0.047123 ┆ 0.946499 │
│ 0.04201  ┆ 0.114195 ┆ 0.843795 │
│ 0.211942 ┆ 0.211942 ┆ 0.576117 │
│ 0.576117 ┆ 0.211942 ┆ 0.211942 │
│ 0.843795 ┆ 0.114195 ┆ 0.04201  │
│ 0.946499 ┆ 0.047123 ┆ 0.006377 │
│ 0.981135 ┆ 0.01797  ┆ 0.000895 │
│ 0.993185 ┆ 0.006692 ┆ 0.000123 │
└──────────┴──────────┴──────────┘

With LazyFrames we can use .explain() to inspect the query plan.

plan = df.lazy().with_columns(pl.all().exp() / pl.sum_horizontal(pl.all().exp())).explain()
print(plan)
# simple π 3/7 ["a", "b", "c"]
#    WITH_COLUMNS:
#    [[(col("__POLARS_CSER_0x9b1b3182d015f390")) / (col("__POLARS_CSER_0x762bfea120ea9e6"))].alias("a"), [(col("__POLARS_CSER_0xb82f49f764da7a09")) / (col("__POLARS_CSER_0x762bfea120ea9e6"))].alias("b"), [(col("__POLARS_CSER_0x1a200912e2bcc700")) / (col("__POLARS_CSER_0x762bfea120ea9e6"))].alias("c")]
#      WITH_COLUMNS:
#      [col("a").exp().alias("__POLARS_CSER_0x9b1b3182d015f390"), col("b").exp().alias("__POLARS_CSER_0xb82f49f764da7a09"), col("c").exp().alias("__POLARS_CSER_0x1a200912e2bcc700"), col("a").exp().sum_horizontal([col("b").exp(), col("c").exp()]).alias("__POLARS_CSER_0x762bfea120ea9e6")]
#       DF ["a", "b", "c"]; PROJECT */3 COLUMNS

Polars caches the duplicate pl.all().exp() expression into a temp __POLARS_CSER* column for you.

See also:

Upvotes: 3

Related Questions