Reputation: 91
I have a list of substrings that I want to replace with ' '. What is the fastest way to do so? Is this possible with cython? This is really slow when applying it to 1 million row so fastest execution is what I'm looking for.
Example:
df = pd.DataFrame({ "text":
["first text to replace"
, "second text to replace"
, "test this string"
, "this is not the first string"
, "short string test"]
})
removal_list = ["text to replace", "this string"]
Some attempts:
def replace_str(df, col, removal_list):
for item in removal_list:
df[col] = df[col].str.replace(item, ' ')
return df
replace_str(df,'text', removal_list)
def replace_text(text):
miscdict_comp = {re.compile(a): ' ' for a in removal_list}
for pattern, replacement in miscdict_comp.items():
text = pattern.sub(replacement, text)
return text
df['text'] = apply(replace_text)
Upvotes: 1
Views: 734
Reputation: 261830
This seems like a simple use of replace
:
reg = '|'.join(removal_list)
df['text'].str.replace(reg, '', regex=True)
output:
0 first
1 second
2 test
3 this is not the first string
4 short string test
Name: text, dtype: object
This runs quite fast, here is the benchmark of a test on 1M rows (df = pd.concat([df]*200000)
using OP's dataframe):
397 ms ± 4.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In comparison:
# replace_str
586 ms ± 9.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# replace_text
1.5 s ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
NB. I removed the assignment parts for the test to just compare the computation, but actually this step also takes time, so multiple assignments will impact performance
Upvotes: 5