deejay217
deejay217

Reputation: 109

Replace entire string based on regex match

I have a large pandas dataframe of email address and wanted to replace all the .edu emails with "Edu". I came up with an highly inefficient way of doing it but there has to be a better way of doing it. This is how I do it:

import pandas as pd
import re
inp = [{'c1':10, 'c2':'gedua.com'},   {'c1':11,'c2':'wewewe.Edu'},   {'c1':12,'c2':'wewewe.edu.ney'}]
dfn = pd.DataFrame(inp)

for index, row in dfn.iterrows():
    try:
        if len(re.search('\.edu', row['c2']).group(0)) > 1:
            dfn.c2[index] = 'Edu'
            print('Education')
    except:
        continue

Upvotes: 2

Views: 1068

Answers (2)

cs95
cs95

Reputation: 403278

Using str.contains for case insensitive selection, and assignment with loc.

dfn.loc[dfn.c2.str.contains(r'\.Edu', case=False), 'c2'] = 'Edu'    
dfn

   c1         c2
0  10  gedua.com
1  11        Edu
2  12        Edu

If it's only the emails ending with .edu you want to replace, then

dfn.loc[dfn.c2.str.contains(r'\.Edu$', case=False), 'c2'] = 'Edu'

Or, as suggested by piR,

dfn.loc[dfn.c2.str.endswith('.Edu'), 'c2'] = 'Edu'

dfn

   c1              c2
0  10       gedua.com
1  11             Edu
2  12  wewewe.edu.ney  

Upvotes: 3

piRSquared
piRSquared

Reputation: 294576

replace

dfn.replace('^.*\.Edu$', 'Edu', regex=True)

   c1              c2
0  10       gedua.com
1  11             Edu
2  12  wewewe.edu.ney

The pattern '^.*\.Edu$' says grab everything from the beginning of the string to the point where we find '.Edu' followed by the end of the string, then replace that whole thing with 'Edu'


Column specific

You may want to limit the scope to just a column (or columns). You can do that by passing a dictionary to replace where the outer key specifies the column and the dictionary value specifies what is to be replaced.

dfn.replace({'c2': {'^.*\.Edu$': 'Edu'}}, regex=True)

   c1              c2
0  10       gedua.com
1  11             Edu
2  12  wewewe.edu.ney

Case insensitive [thx @coldspeed]

pandas.DataFrame.replace does not have a case flag. But you can imbed it in the pattern with '(?i)'

dfn.replace({'c2': {'(?i)^.*\.edu$': 'Edu'}}, regex=True)

   c1              c2
0  10       gedua.com
1  11             Edu
2  12  wewewe.edu.ney

Upvotes: 2

Related Questions