dh Lin
dh Lin

Reputation: 11

polars apply a lambda with list comprehension like pandas: Any other better way?

pandas

df['sentences'] = df['content'].str.split(pattern2)
df['normal_text'] = df['sentences'].apply(lambda x: [re.sub(pattern3, ' ', sentence) for sentence in x])

polars

df = df.with_columns(pl.col('content').map_elements(lambda x: re.split(pattern2, x)).alias('sentences'))
df = df.with_columns(pl.col('sentences').map_elements(lambda x: [re.sub(pattern3, ' ', sentence) for sentence in x]).alias('normal_text'))

Amy more elegant way than this?

Upvotes: 1

Views: 3215

Answers (1)

jqurious
jqurious

Reputation: 21580

The functionality is available natively in Polars via the .str namespace.

.str.split() doesn't support regex.

But similar behaviour can be achieved with .extract_all() and .replace_all()

df = pl.DataFrame({"content": ["o neHItw oHIIIIIth ree", "fo urHIIfi veHIIIIs ix"]})

pattern2 = r"HI+"
pattern3 = r"\s"

replacement = ""
df.with_columns(
   pl.col("content").str.extract_all(rf".*?({pattern2}|$)")
     .alias("sentences")
)
shape: (2, 2)
┌────────────────────────┬────────────────────────────────────┐
│ content                ┆ sentences                          │
│ ---                    ┆ ---                                │
│ str                    ┆ list[str]                          │
╞════════════════════════╪════════════════════════════════════╡
│ o neHItw oHIIIIIth ree ┆ ["o neHI", "tw oHIIIII", "th ree"] │
│ fo urHIIfi veHIIIIs ix ┆ ["fo urHII", "fi veHIIII", "s ix"] │
└────────────────────────┴────────────────────────────────────┘

list.eval() could then be used to process the list and "extract" the desired result.

df.with_columns(
   pl.col("content").str.extract_all(rf".*?({pattern2}|$)")
     .list.eval(
        pl.element().str.replace_all(pattern2, "")
                    .str.replace_all(pattern3, replacement)
     )
     .alias("normal_text")
)
shape: (2, 2)
┌────────────────────────┬─────────────────────────┐
│ content                ┆ normal_text             │
│ ---                    ┆ ---                     │
│ str                    ┆ list[str]               │
╞════════════════════════╪═════════════════════════╡
│ o neHItw oHIIIIIth ree ┆ ["one", "two", "three"] │
│ fo urHIIfi veHIIIIs ix ┆ ["four", "five", "six"] │
└────────────────────────┴─────────────────────────┘

Performance

A basic comparison of both approaches.

N = 2000
df = pl.DataFrame({
   "content": [
      "o neHItw oHIIIIIth ree" * N, 
      "fo urHIIfi veHIIIIs ix" * N] * N
})
Name Time
.str + .list.eval() 8.28s
.map_elements() 29.9s

Upvotes: 4

Related Questions