Reputation: 437
I'm working on a NLP model at the moment and are currently optimizing the pre-processing steps. Since I'm using a custom function polars cannot parallelize the operation.
I've tried few things with polars "replace_all" and some ".when.then.otherwise" but have not found a solution yet.
In this case am I doing "expand contractions" (e.g. I'm -> I am).
I currently use this:
# This is only a few example contractions that I use.
cList = {
"i'm": "i am",
"i've": "i have",
"isn't": "is not"
}
c_re = re.compile("(%s)" % "|".join(cList.keys()))
def expandContractions(text, c_re=c_re):
def replace(match):
return cList[match.group(0)]
return c_re.sub(replace, text)
df = pl.DataFrame({"Text": ["i'm i've, isn't"]})
df["Text"].map_elements(expandContractions)
Outputs
shape: (1, 1)
┌─────────────────────┐
│ Text │
│ --- │
│ str │
╞═════════════════════╡
│ i am i have, is not │
└─────────────────────┘
But would like to use the full performance benfits of polars because the datasets I process are quite large.
Performance test:
#This dict have 100+ key/value pairs in my test case
cList = {
"i'm": "i am",
"i've": "i have",
"isn't": "is not"
}
def base_case(sr: pl.Series) -> pl.Series:
c_re = re.compile("(%s)" % "|".join(cList.keys()))
def expandContractions(text, c_re=c_re):
def replace(match):
return cList[match.group(0)]
return c_re.sub(replace, text)
sr = sr.map_elements(expandContractions)
return sr
def loop_case(sr: pl.Series) -> pl.Series:
for old, new in cList.items():
sr = sr.str.replace_all(old, new, literal=True)
return sr
def iter_case(sr: pl.Series) -> pl.Series:
sr = functools.reduce(
lambda res, x: getattr(getattr(res, "str"), "replace_all")(
x[0], x[1], literal=True
),
cList.items(),
sr,
)
return sr
They all return equal results and here are the average times for 15 loops of ~10,000 samples with a sample length of ~500 characters.
Base case: 16.112362766265868
Loop case: 7.028670716285705
Iter case: 7.112465214729309
So it is more than double the speed using either of these methods and that's mostly thanks to polars API-call "replace_all". I ended up using the loop case since then I've one less module to import. See this question answered by jqurious
Upvotes: 3
Views: 812
Reputation: 21544
.str.replace_many()
has since been added to Polars. (for non-regex replacements)
df.with_columns(
pl.col.Text.str.replace_many(
["i'm", "i've", "isn't"],
["i am", "i have", "is not"]
)
)
shape: (1, 1)
┌─────────────────────┐
│ Text │
│ --- │
│ str │
╞═════════════════════╡
│ i am i have, is not │
└─────────────────────┘
It currently requires passing "old" and "new" separately, but will also accept a dictionary in the future.
Upvotes: 1
Reputation: 10541
How about
(
df['Text']
.str.replace_all("i'm", "i am", literal=True)
.str.replace_all("i've", "i have", literal=True)
.str.replace_all("isn't", "is not", literal=True)
)
?
or:
functools.reduce(
lambda res, x: getattr(
getattr(res, "str"), "replace_all"
)(x[0], x[1], literal=True),
cList.items(),
df["Text"],
)
Upvotes: 1