enchanted_potato
enchanted_potato

Reputation: 91

Fastest way for substring replacement in pandas

I have a list of substrings that I want to replace with ' '. What is the fastest way to do so? Is this possible with cython? This is really slow when applying it to 1 million row so fastest execution is what I'm looking for.

Example:

df = pd.DataFrame({ "text":
                    ["first text to replace"
                     , "second text to replace"
                     , "test this string"
                     , "this is not the first string"
                     , "short string test"]
                    })

removal_list = ["text to replace", "this string"]

Some attempts:

def replace_str(df, col, removal_list):
    for item in removal_list:
        df[col] = df[col].str.replace(item, ' ')
    return df

replace_str(df,'text', removal_list)



 def replace_text(text):
    miscdict_comp = {re.compile(a): ' ' for a in removal_list}
    for pattern, replacement in miscdict_comp.items():
        text = pattern.sub(replacement, text)
    return text

df['text'] = apply(replace_text)

Upvotes: 1

Views: 734

Answers (1)

mozway
mozway

Reputation: 261830

This seems like a simple use of replace:

reg = '|'.join(removal_list)
df['text'].str.replace(reg, '', regex=True)

output:

0                          first 
1                         second 
2                           test 
3    this is not the first string
4               short string test
Name: text, dtype: object

This runs quite fast, here is the benchmark of a test on 1M rows (df = pd.concat([df]*200000) using OP's dataframe):

397 ms ± 4.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In comparison:

# replace_str
586 ms ± 9.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# replace_text
1.5 s ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

NB. I removed the assignment parts for the test to just compare the computation, but actually this step also takes time, so multiple assignments will impact performance

Upvotes: 5

Related Questions