Reputation: 149
For the sake of optimization, I want to know if its possible to do a faster string replace in one column, with the contents of the corresponding row from another column, without using apply.
Here is my dataframe:
data_dict = {'root': [r'c:/windows/'], 'file': [r'c:/windows/system32/calc.exe']}
df = pd.DataFrame.from_dict(data_dict)
"""
Result:
file root
0 c:/windows/system32/calc.exe c:/windows/
"""
Using the following apply, I can get what I'm after:
df['trunc'] = df.apply(lambda x: x['file'].replace(x['path'], ''), axis=1)
"""
Result:
file root trunc
0 c:/windows/system32/calc.exe c:/windows/ system32/calc.exe
"""
However, in the interest of making more efficient use of code, I'm wondering if there is a better way. I've tried the code below, but it doesn't seem to work the way I expected it to.
df['trunc'] = df['file'].replace(df['root'], '')
"""
Result (note that the root was NOT properly replaced with a black string in the 'trunc' column):
file root trunc
0 c:/windows/system32/calc.exe c:/windows/ c:/windows/system32/calc.exe
"""
Are there any more effecient alternatives? Thanks!
EDIT - With timings from the couple of examples below
# Expand out the data set to 1000 entries
data_dict = {'root': [r'c:/windows/']*1000, 'file': [r'c:/windows/system32/calc.exe']*1000}
df0 = pd.DataFrame.from_dict(data_dict)
Using Apply
%%timeit -n 100
df0['trunk0'] = df0.apply(lambda x: x['file'].replace(x['root'], ''), axis=1)
100 loops, best of 3: 13.9 ms per loop
Using Replace (thanks Gayatri)
%%timeit -n 100
df0['trunk1'] = df0['file'].replace(df0['root'], '', regex=True)
100 loops, best of 3: 365 ms per loop
Using Zip (thanks 0p3n5ourcE)
%%timeit -n 100
df0['trunk2'] = [file_val.replace(root_val, '') for file_val, root_val in zip(df0.file, df0.root)]
100 loops, best of 3: 600 µs per loop
Overall, looks like zip is the best option here. Thanks for all the input!
Upvotes: 2
Views: 68
Reputation: 18208
Using similar approach as in link
df['trunc'] = [file_val.replace(root_val, '') for file_val, root_val in zip(df.file, df.root)]
Output:
file root trunc
0 c:/windows/system32/calc.exe c:/windows/ system32/calc.exe
Checking with timeit
:
%%timeit
df['trunc'] = df.apply(lambda x: x['file'].replace(x['root'], ''), axis=1)
Result:
1000 loops, best of 3: 469 µs per loop
Using zip:
%%timeit
df['trunc'] = [file_val.replace(root_val, '') for file_val, root_val in zip(df.file, df.root)]
Result:
1000 loops, best of 3: 322 µs per loop
Upvotes: 1
Reputation: 2253
Try this:
df['file'] = df['file'].astype(str)
df['root'] = df['root'].astype(str)
df['file'].replace(df['root'],'', regex=True)
Output:
0 system32/calc.exe
Name: file, dtype: object
Upvotes: 1