destructo
destructo

Reputation: 149

Series string replace with contents from another series (without using apply)

For the sake of optimization, I want to know if its possible to do a faster string replace in one column, with the contents of the corresponding row from another column, without using apply.

Here is my dataframe:

data_dict = {'root': [r'c:/windows/'], 'file': [r'c:/windows/system32/calc.exe']}
df = pd.DataFrame.from_dict(data_dict)

"""
Result:
                           file         root
0  c:/windows/system32/calc.exe  c:/windows/
"""

Using the following apply, I can get what I'm after:

df['trunc'] = df.apply(lambda x: x['file'].replace(x['path'], ''), axis=1)

"""
Result:
                           file         root              trunc
0  c:/windows/system32/calc.exe  c:/windows/  system32/calc.exe 
"""

However, in the interest of making more efficient use of code, I'm wondering if there is a better way. I've tried the code below, but it doesn't seem to work the way I expected it to.

df['trunc'] = df['file'].replace(df['root'], '')

"""
Result (note that the root was NOT properly replaced with a black string in the 'trunc' column):

                           file         root                         trunc
0  c:/windows/system32/calc.exe  c:/windows/  c:/windows/system32/calc.exe
"""

Are there any more effecient alternatives? Thanks!

EDIT - With timings from the couple of examples below

# Expand out the data set to 1000 entries
data_dict = {'root': [r'c:/windows/']*1000, 'file': [r'c:/windows/system32/calc.exe']*1000}
df0 = pd.DataFrame.from_dict(data_dict)

Using Apply

%%timeit -n 100
df0['trunk0'] = df0.apply(lambda x: x['file'].replace(x['root'], ''), axis=1)

100 loops, best of 3: 13.9 ms per loop

Using Replace (thanks Gayatri)

%%timeit -n 100
df0['trunk1'] = df0['file'].replace(df0['root'], '', regex=True)

100 loops, best of 3: 365 ms per loop

Using Zip (thanks 0p3n5ourcE)

%%timeit -n 100
df0['trunk2'] = [file_val.replace(root_val, '') for file_val, root_val in zip(df0.file, df0.root)]

100 loops, best of 3: 600 µs per loop

Overall, looks like zip is the best option here. Thanks for all the input!

Upvotes: 2

Views: 68

Answers (2)

niraj
niraj

Reputation: 18208

Using similar approach as in link

df['trunc'] = [file_val.replace(root_val, '') for file_val, root_val in zip(df.file, df.root)]

Output:

                          file         root              trunc
0  c:/windows/system32/calc.exe  c:/windows/  system32/calc.exe

Checking with timeit:

%%timeit
df['trunc'] = df.apply(lambda x: x['file'].replace(x['root'], ''), axis=1)

Result:

1000 loops, best of 3: 469 µs per loop

Using zip:

%%timeit
df['trunc'] = [file_val.replace(root_val, '') for file_val, root_val in zip(df.file, df.root)]

Result:

1000 loops, best of 3: 322 µs per loop

Upvotes: 1

Gayatri
Gayatri

Reputation: 2253

Try this:

df['file'] = df['file'].astype(str)
df['root'] = df['root'].astype(str)
df['file'].replace(df['root'],'', regex=True)

Output:

0    system32/calc.exe
Name: file, dtype: object

Upvotes: 1

Related Questions