Reputation: 395
Is it proper to use pl.Expr.map_elements
to throw the python function zfill
at my data? I'm not looking for a performant solution.
pl.col("column").map_elements(lambda x: str(x).zfill(5))
Is there a better way to do this?
And to follow up I'd love to chat about what a good implementation could look like in the discord if you have some insight (assuming one doesn't currently exist).
Upvotes: 2
Views: 465
Reputation:
0.13.43
and laterWith version 0.13.43
and later, Polars has a str.zfill
expression to accomplish this. str.zfill
will be faster than the answer below and thus str.zfill
should be preferred.
From your question, I'm assuming that you are starting with a column of integers.
lambda x: str(x).zfill(5)
If so, here's one that adheres to pandas rather strictly:
import polars as pl
df = pl.DataFrame({"num": [-10, -1, 0, 1, 10, 100, 1000, 10000, 100000, 1000000, None]})
z = 5
df.with_columns(
pl.when(pl.col("num").cast(pl.String).str.len_chars() > z)
.then(pl.col("num").cast(pl.String))
.otherwise(pl.concat_str(pl.lit("0" * z), pl.col("num").cast(pl.String)).str.slice(-z))
.alias("result")
)
shape: (11, 2)
┌─────────┬─────────┐
│ num ┆ result │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════════╪═════════╡
│ -10 ┆ 00-10 │
│ -1 ┆ 000-1 │
│ 0 ┆ 00000 │
│ 1 ┆ 00001 │
│ 10 ┆ 00010 │
│ … ┆ … │
│ 1000 ┆ 01000 │
│ 10000 ┆ 10000 │
│ 100000 ┆ 100000 │
│ 1000000 ┆ 1000000 │
│ null ┆ null │
└─────────┴─────────┘
Comparing the output to pandas:
df.with_columns(pl.col('num').cast(pl.String)).get_column('num').to_pandas().str.zfill(z)
0 00-10
1 000-1
2 00000
3 00001
4 00010
5 00100
6 01000
7 10000
8 100000
9 1000000
10 None
dtype: object
If you are starting with strings, then you can simplify the code by getting rid any calls to cast
.
Edit: On a dataset with 550 million records, this took about 50 seconds on my machine. (Note: this runs single-threaded)
Edit2: To shave off some time, you can use the following:
result = df.lazy().with_columns(
pl.col('num').cast(pl.String).alias('tmp')
).with_columns(
pl.when(pl.col("tmp").str.len_chars() > z)
.then(pl.col("tmp"))
.otherwise(pl.concat_str(pl.lit("0" * z), pl.col("tmp")).str.slice(-z))
.alias("result")
).drop('tmp').collect()
but it didn't save that much time.
Upvotes: 3