cnpryer
cnpryer

Reputation: 395

Is there a good way to do `zfill` in polars?

Is it proper to use pl.Expr.map_elements to throw the python function zfill at my data? I'm not looking for a performant solution.

pl.col("column").map_elements(lambda x: str(x).zfill(5))

Is there a better way to do this?

And to follow up I'd love to chat about what a good implementation could look like in the discord if you have some insight (assuming one doesn't currently exist).

Upvotes: 2

Views: 465

Answers (1)

user18559875
user18559875

Reputation:

Edit: Polars 0.13.43 and later

With version 0.13.43 and later, Polars has a str.zfill expression to accomplish this. str.zfill will be faster than the answer below and thus str.zfill should be preferred.


From your question, I'm assuming that you are starting with a column of integers.

lambda x: str(x).zfill(5)

If so, here's one that adheres to pandas rather strictly:

import polars as pl
df = pl.DataFrame({"num": [-10, -1, 0, 1, 10, 100, 1000, 10000, 100000, 1000000, None]})

z = 5
df.with_columns(
    pl.when(pl.col("num").cast(pl.String).str.len_chars() > z)
    .then(pl.col("num").cast(pl.String))
    .otherwise(pl.concat_str(pl.lit("0" * z), pl.col("num").cast(pl.String)).str.slice(-z))
    .alias("result")
)
shape: (11, 2)
┌─────────┬─────────┐
│ num     ┆ result  │
│ ---     ┆ ---     │
│ i64     ┆ str     │
╞═════════╪═════════╡
│ -10     ┆ 00-10   │
│ -1      ┆ 000-1   │
│ 0       ┆ 00000   │
│ 1       ┆ 00001   │
│ 10      ┆ 00010   │
│ …       ┆ …       │
│ 1000    ┆ 01000   │
│ 10000   ┆ 10000   │
│ 100000  ┆ 100000  │
│ 1000000 ┆ 1000000 │
│ null    ┆ null    │
└─────────┴─────────┘

Comparing the output to pandas:

df.with_columns(pl.col('num').cast(pl.String)).get_column('num').to_pandas().str.zfill(z)
0       00-10
1       000-1
2       00000
3       00001
4       00010
5       00100
6       01000
7       10000
8      100000
9     1000000
10       None
dtype: object

If you are starting with strings, then you can simplify the code by getting rid any calls to cast.

Edit: On a dataset with 550 million records, this took about 50 seconds on my machine. (Note: this runs single-threaded)

Edit2: To shave off some time, you can use the following:

result = df.lazy().with_columns(
    pl.col('num').cast(pl.String).alias('tmp')
).with_columns(
    pl.when(pl.col("tmp").str.len_chars() > z)
    .then(pl.col("tmp"))
    .otherwise(pl.concat_str(pl.lit("0" * z), pl.col("tmp")).str.slice(-z))
    .alias("result")
).drop('tmp').collect()

but it didn't save that much time.

Upvotes: 3

Related Questions