myeewyee
myeewyee

Reputation: 767

Pandas replace full word string

I have a dataframe:

df = pd.DataFrame({'id' : ['abarth 1.4 a','abarth 1 a','land rover 1.3 r','land rover 2',
                           'land rover 5 g','mazda 4.55 bl'], 
                   'series': ['a','a','r','','g', 'bl'] })

I would like to remove the 'series' string from the corresponding id, so the end result should be:

'id': ['abarth 1.4','abarth 1','land rover 1.3','land rover 2','land rover 5', 'mazda 4.55']

Currently I am using df.apply:

df.id = df.apply(lambda x: x['id'].replace(x['series'], ''), axis =1)

But this removes all instances of the strings, even in other words, like so: 'id': ['brth 1.4','brth 1','land ove 1.3','land rover 2','land rover 5', 'mazda 4.55']

Should I somehow mix and match regex with the variable inside df.apply, like so?

df.id = df.apply(lambda x: x['id'].replace(r'\b' + x['series'], ''), axis =1)

Upvotes: 1

Views: 567

Answers (2)

unutbu
unutbu

Reputation: 879391

You could use str.rpartition to split the ids on the last space.

In [169]: parts = df['id'].str.rpartition(' ')[[0,2]]; parts
Out[169]: 
                0   2
0      abarth 1.4   a
1        abarth 1   a
2  land rover 1.3   r
3      land rover   2
4    land rover 5   g
5      mazda 4.55  bl

Then you could use == to compare parts[2] to df['series']:

In [170]: mask = (parts[2] == df['series']); mask
Out[170]: 
0     True
1     True
2     True
3    False
4     True
5     True
dtype: bool

And finally, use df['id'].where to replace df['id] with parts[0] where mask is True:

import pandas as pd
df = pd.DataFrame(
    {'id' : ['abarth 1.4 a','abarth 1 a','land rover 1.3 r','land rover 2',
             'land rover 5 g','mazda 4.55 bl'], 
     'series': ['a','a','r','','g', 'bl'] })
parts = df['id'].str.rpartition(' ')[[0,2]]
mask = (parts[2] == df['series'])
df['id'] = df['id'].where(~mask, parts[0], axis=0)
print(df)

yields

               id series
0      abarth 1.4      a
1        abarth 1      a
2  land rover 1.3      r
3    land rover 2       
4    land rover 5      g
5      mazda 4.55     bl

Alternatively, you could use

import re
def remove_series(x):
    pat = r'{}$'.format(x['series'])
    return re.sub(pat, '', x['id'])
df['id'] = df.apply(remove_series, axis=1)

But calling df.apply with a custom function tends to be much slower than using built-in vectorize methods such as those used in the first method.

Upvotes: 1

Gustavo Bezerra
Gustavo Bezerra

Reputation: 11034

Using re, in case you want to specify the series string:

df.apply(lambda x: re.sub('\s*{}$'.format(x['series']), '', x['id']), axis=1)

In case the the series string is always a predictable pattern (i.e. [a-z]) you can also try:

df['id'].apply(lambda x: re.sub('\s*[a-z]+$', '', x))

Either way the output is what you are looking for:

0        abarth 1.4
1          abarth 1
2    land rover 1.3
3      land rover 2
4      land rover 5
5        mazda 4.55

Upvotes: 2

Related Questions