Chris Dixon
Chris Dixon

Reputation: 1046

Most efficient way to str.replace regex=True or similar in pandas?

Dozens of strings need to be replaced across multiple columns of thousands of dfs:

for df in dfs:
    for col in columns:
        for key, value in replacement_strs.items():
            df[col] = df[col].str.replace(key, value, regex=True)

Iterations of above take milliseconds but added up this takes hours, so we need something more efficient.

Could we apply re_sub or similar in a more efficient way? Use CstringIO like this answer suggests somehow? Some kind of vectorization?

Applying str.replace might be more efficient after combining dfs but pd.concat() blows out of available memory.

Edit: Crude reproducible example below. Note how elapsed time increases linearly with number of dfs when total number of cells is held constant by reducing rows per df (play with shape[0] and range(0,1000)):

import pandas as pd, numpy as np, string, random
from timeit import default_timer as timer
np.random.seed(123)

dfs = []
shape = [500, 10]

df = pd.DataFrame(np.arange(shape[0] * shape[1]).reshape(shape[0],shape[1])).applymap(lambda x: np.random.choice(list(string.ascii_letters.upper())))

for n in range(0,1000):
    dfs.append(df)

start = timer()

for df in dfs:
    for col in [col for col in range(0,shape[1])]:
        for key, value in {'A$': 'W','B': 'X','C[a-z]': 'Y','D': 'Z',}.items():
            df[col] = df[col].str.replace(key, value, regex=True)

end = timer()
print(end - start)

Upvotes: 1

Views: 585

Answers (1)

sammywemmy
sammywemmy

Reputation: 28709

If I understand you correctly, you could use pandas replace, along with your dictionary, and a comprehension to get your output, with improved speed:

mapping = {'A$': 'W','B': 'X','C[a-z]': 'Y','D': 'Z',}
[entry.replace(mapping, regex = False) for entry in dfs]

On my PC, your function runs in 17s, while the list comprehension above runs in 1.2s. There may be improvements (generators, multiprocessing). Using replace is a good start, before further optimisation (if really needed).

Upvotes: 1

Related Questions