Reputation: 117337
Given a DataFrame with categorical column:
import polars as pl
df = pl.DataFrame({
"id": ["a", "a", "a", "b", "b", "b", "b"],
"value": [1,1,1,6,6,6,6],
})
res = df.with_columns(bucket = pl.col.value.cut([1,3]))
shape: (7, 3)
┌─────┬───────┬───────────┐
│ id ┆ value ┆ bucket │
│ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ cat │
╞═════╪═══════╪═══════════╡
│ a ┆ 1 ┆ (-inf, 1] │
│ a ┆ 1 ┆ (-inf, 1] │
│ a ┆ 1 ┆ (-inf, 1] │
│ b ┆ 6 ┆ (3, inf] │
│ b ┆ 6 ┆ (3, inf] │
│ b ┆ 6 ┆ (3, inf] │
│ b ┆ 6 ┆ (3, inf] │
└─────┴───────┴───────────┘
How do I get all potential values of the categorical column?
I can get them as strings with pl.Expr.cat.get_categories()
as strings?
res.select(pl.col.bucket.cat.get_categories())
shape: (3, 1)
┌───────────┐
│ bucket │
│ --- │
│ str │
╞═══════════╡
│ (-inf, 1] │
│ (1, 3] │
│ (3, inf] │
└───────────┘
I can also get all existing values in their physical representation with pl.Expr.to_physical()
res.select(pl.col.bucket.to_physical())
shape: (7, 1)
┌────────┐
│ bucket │
│ --- │
│ u32 │
╞════════╡
│ 0 │
│ 0 │
│ 0 │
│ 2 │
│ 2 │
│ 2 │
│ 2 │
└────────┘
But how I can get all potential values in their physical representation? I'd expect output like:
shape: (3, 1)
┌────────┐
│ bucket │
│ --- │
│ u32 │
╞════════╡
│ 0 │
│ 1 │
│ 2 │
└────────┘
Or should I just assume that it's always encoded as range of integers without gaps?
Upvotes: 2
Views: 68
Reputation: 9769
I don't see any direct way. However, you could combine pl.Expr.cat.get_categories
and pl.Expr.to_physical
as follows.
res.select(
pl.col("bucket").cat.get_categories().cast(res.schema["bucket"]).to_physical()
)
shape: (3, 1)
┌────────┐
│ bucket │
│ --- │
│ u32 │
╞════════╡
│ 0 │
│ 1 │
│ 2 │
└────────┘
Here, it would be nice to have pl.Expr.meta.dtype
implemented, such accessing res
again can be avoided.
Upvotes: 2