String wildcard in pandas on replace function

Question

I'm sure this problem has an easy answer but I'm having trouble figuring out the correct string to use. I basically want to replace any email address in a data frame with the new domain. For a specific column, replace the substring '@*' where * is any set of characters with '@newcompany.com'. I want to keep whatever comes prior to the @ as is. Thanks all.

df_users['EMAIL'] = df_users['EMAIL'].str.replace('@', '@newcompany.com')

EdChum · Accepted Answer

You can use the vectorised str method to split on '@' character and then join the left side with the new domain name:

In [42]:

df = pd.DataFrame({'email':['asdsad@old.com', 'asdsa@google.com', 'hheherhe@apple.com']})
df
Out[42]:
                email
0      asdsad@old.com
1    asdsa@google.com
2  hheherhe@apple.com

In [43]:

df['email'] = df.email.str.split('@').str[0] + '@newcompany.com'
df

Out[43]:
                     email
0    asdsad@newcompany.com
1     asdsa@newcompany.com
2  hheherhe@newcompany.com

another method is to call the vectorised replace which accepts a regex as a pattern on the strings:

In [56]:

df['email'] = df['email'].str.replace(r'@.+', '@newcompany.com')
df
Out[56]:
                     email
0    asdsad@newcompany.com
1     asdsa@newcompany.com
2  hheherhe@newcompany.com

Timings

In [58]:

%timeit df['email'] = df['email'].str.replace(r'@.+', '@newcompany.com')
1000 loops, best of 3: 632 µs per loop
In [60]:

%timeit df['email'] = df.email.str.split('@').str[0] + '@newcompany.com'
1000 loops, best of 3: 1.66 ms per loop

In [63]:

%timeit df['email'] = df['email'].replace(r'@.+', '@newcompany.com', regex=True)
1000 loops, best of 3: 738 µs per loop

Here we can see that the str.replace regex version is nearly 3x faster than the split method, interestingly the Series.replace method which would seem to be doing the same thing as the str.replace is slower.

String wildcard in pandas on replace function

Answers (2)

Related Questions