nilsinelabore
nilsinelabore

Reputation: 5095

Filter and replace substring in Pandas

How can I filter df rows where name contains Al, and replace large with L in sport?

Reproducible example:

df = pd.DataFrame({'name': ['Bob', 'Jane', 'Alice'], 
                   'sport': ['tennis small', 'football medium', 'basketball large']})

Expected output:

    name    sport
0   Bob     tennis small
1   Jane    football medium
2   Alice   basketball L

Upvotes: 2

Views: 878

Answers (3)

jezrael
jezrael

Reputation: 862611

Filter only necessary rows and for them use Series.str.replace, it is better for performance like replace all column without filtering:

m = df['name'].str.contains('Al', na=False)
df.loc[m, 'sport'] = df.loc[m, 'sport'].str.replace('large', 'L', regex=True)
print (df)
    name            sport
0    Bob     tennis small
1   Jane  football medium
2  Alice     basketball L

#100 matched values from 30k
df = pd.DataFrame({'name': ['Bob','Jane','alice'] * 9900 + ['Bob', 'Jane', 'Alice'] * 100, 
                   'sport': ['tennis small','football medium', 'basketball large'] * 10000})

print (df)
        name             sport
0        Bob      tennis small
1       Jane   football medium
2      alice  basketball large
3        Bob      tennis small
4       Jane   football medium
     ...               ...
29995   Jane   football medium
29996  Alice  basketball large
29997    Bob      tennis small
29998   Jane   football medium
29999  Alice  basketball large

[30000 rows x 2 columns]
In [76]: %%timeit
    ...: m = df['name'].str.contains('Al')
    ...: df.loc[m, 'sport'] = df.loc[m, 'sport'].str.replace('large', 'L', regex=True)
    ...: 
    ...: 
14.6 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [77]: %%timeit
    ...: df.loc[df.name.str.contains('Al'), 'sport'] = df.sport.str.replace('large', 'L')
    ...: 
    ...: 
34.8 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [78]: %%timeit
    ...: df['sport'] = np.where(df['name'].str.contains('Al'), df['sport'].str.replace('large', 'L', regex=True),  df['sport'])
    ...: 
    ...: 
35 ms ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

#10k matched values from 30k
df = pd.DataFrame({'name': ['Bob', 'Jane','Alice'] * 10000, 
                   'sport': ['tennis small', 'football medium', 'basketball large'] * 10000})



print (df)


In [80]: %%timeit
    ...: m = df['name'].str.contains('Al')
    ...: df.loc[m, 'sport'] = df.loc[m, 'sport'].str.replace('large', 'L', regex=True)
    ...: 
    ...: 
22.2 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [81]: %%timeit
    ...: df.loc[df.name.str.contains('Al'), 'sport'] = df.sport.str.replace('large', 'L')
    ...: 
    ...: 
34 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [82]: %%timeit
    ...: df['sport'] = np.where(df['name'].str.contains('Al'), df['sport'].str.replace('large', 'L', regex=True),  df['sport'])
    ...: 
    ...: 
34.9 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 1

sophocles
sophocles

Reputation: 13821

Similarly you can achieve that with numpy.where in conjunction with str.contains:

import numpy as np

df['sport'] = np.where(df['name'].str.contains('Al'),                     # Condition 
                       df['sport'].str.replace('large', 'L', regex=True), # return if True
                       df['sport'])                                       # return if False
>>> df
    name            sport
0    Bob     tennis small
1   Jane  football medium
2  Alice     basketball L

Upvotes: 1

U13-Forward
U13-Forward

Reputation: 71570

Try with loc and str.contains with str.replace:

df.loc[df.name.str.contains('Al'), 'sport'] = df.sport.str.replace('large', 'L')

Example:

>>> df.loc[df.name.str.contains('Al'), 'sport'] = df.sport.str.replace('large', 'L')
>>> df
    name            sport
0    Bob     tennis small
1   Jane  football medium
2  Alice     basketball L
>>> 

Filter the rows where the name column contains Al and then replace large with L in the sports column.

Upvotes: 1

Related Questions