roman
roman

Reputation: 117337

Polars get all possible categories as physical representation

Given a DataFrame with categorical column:

import polars as pl

df = pl.DataFrame({
    "id": ["a", "a", "a", "b", "b", "b", "b"],
    "value": [1,1,1,6,6,6,6],
})

res = df.with_columns(bucket = pl.col.value.cut([1,3]))
shape: (7, 3)
┌─────┬───────┬───────────┐
│ id  ┆ value ┆ bucket    │
│ --- ┆ ---   ┆ ---       │
│ str ┆ i64   ┆ cat       │
╞═════╪═══════╪═══════════╡
│ a   ┆ 1     ┆ (-inf, 1] │
│ a   ┆ 1     ┆ (-inf, 1] │
│ a   ┆ 1     ┆ (-inf, 1] │
│ b   ┆ 6     ┆ (3, inf]  │
│ b   ┆ 6     ┆ (3, inf]  │
│ b   ┆ 6     ┆ (3, inf]  │
│ b   ┆ 6     ┆ (3, inf]  │
└─────┴───────┴───────────┘

How do I get all potential values of the categorical column? I can get them as strings with pl.Expr.cat.get_categories() as strings?

res.select(pl.col.bucket.cat.get_categories())
shape: (3, 1)
┌───────────┐
│ bucket    │
│ ---       │
│ str       │
╞═══════════╡
│ (-inf, 1] │
│ (1, 3]    │
│ (3, inf]  │
└───────────┘

I can also get all existing values in their physical representation with pl.Expr.to_physical()

res.select(pl.col.bucket.to_physical())
shape: (7, 1)
┌────────┐
│ bucket │
│ ---    │
│ u32    │
╞════════╡
│ 0      │
│ 0      │
│ 0      │
│ 2      │
│ 2      │
│ 2      │
│ 2      │
└────────┘

But how I can get all potential values in their physical representation? I'd expect output like:

shape: (3, 1)
┌────────┐
│ bucket │
│ ---    │
│ u32    │
╞════════╡
│ 0      │
│ 1      │
│ 2      │
└────────┘

Or should I just assume that it's always encoded as range of integers without gaps?

Upvotes: 2

Views: 68

Answers (1)

Hericks
Hericks

Reputation: 9769

I don't see any direct way. However, you could combine pl.Expr.cat.get_categories and pl.Expr.to_physical as follows.

res.select(
    pl.col("bucket").cat.get_categories().cast(res.schema["bucket"]).to_physical()
)
shape: (3, 1)
┌────────┐
│ bucket │
│ ---    │
│ u32    │
╞════════╡
│ 0      │
│ 1      │
│ 2      │
└────────┘

Here, it would be nice to have pl.Expr.meta.dtype implemented, such accessing res again can be avoided.

Upvotes: 2

Related Questions