Reputation: 11
pandas
df['sentences'] = df['content'].str.split(pattern2)
df['normal_text'] = df['sentences'].apply(lambda x: [re.sub(pattern3, ' ', sentence) for sentence in x])
polars
df = df.with_columns(pl.col('content').map_elements(lambda x: re.split(pattern2, x)).alias('sentences'))
df = df.with_columns(pl.col('sentences').map_elements(lambda x: [re.sub(pattern3, ' ', sentence) for sentence in x]).alias('normal_text'))
Amy more elegant way than this?
Upvotes: 1
Views: 3215
Reputation: 21580
The functionality is available natively in Polars via the .str
namespace.
.str.split()
doesn't support regex.
But similar behaviour can be achieved with .extract_all()
and .replace_all()
df = pl.DataFrame({"content": ["o neHItw oHIIIIIth ree", "fo urHIIfi veHIIIIs ix"]})
pattern2 = r"HI+"
pattern3 = r"\s"
replacement = ""
df.with_columns(
pl.col("content").str.extract_all(rf".*?({pattern2}|$)")
.alias("sentences")
)
shape: (2, 2)
┌────────────────────────┬────────────────────────────────────┐
│ content ┆ sentences │
│ --- ┆ --- │
│ str ┆ list[str] │
╞════════════════════════╪════════════════════════════════════╡
│ o neHItw oHIIIIIth ree ┆ ["o neHI", "tw oHIIIII", "th ree"] │
│ fo urHIIfi veHIIIIs ix ┆ ["fo urHII", "fi veHIIII", "s ix"] │
└────────────────────────┴────────────────────────────────────┘
list.eval()
could then be used to process the list and "extract" the desired result.
df.with_columns(
pl.col("content").str.extract_all(rf".*?({pattern2}|$)")
.list.eval(
pl.element().str.replace_all(pattern2, "")
.str.replace_all(pattern3, replacement)
)
.alias("normal_text")
)
shape: (2, 2)
┌────────────────────────┬─────────────────────────┐
│ content ┆ normal_text │
│ --- ┆ --- │
│ str ┆ list[str] │
╞════════════════════════╪═════════════════════════╡
│ o neHItw oHIIIIIth ree ┆ ["one", "two", "three"] │
│ fo urHIIfi veHIIIIs ix ┆ ["four", "five", "six"] │
└────────────────────────┴─────────────────────────┘
A basic comparison of both approaches.
N = 2000
df = pl.DataFrame({
"content": [
"o neHItw oHIIIIIth ree" * N,
"fo urHIIfi veHIIIIs ix" * N] * N
})
Name | Time |
---|---|
.str + .list.eval() | 8.28s |
.map_elements() | 29.9s |
Upvotes: 4