Pruthvi Reddy
Pruthvi Reddy

Reputation: 43

Find a regex and modify only a part of it in a Pandas DataFrame

Imagine I have a few values like

test_val1 = 'E 18TH ST AND A AVE'
test_val2 = 'E 31ST ST AND A AVE'

I want to find the 18th, 31st, etc., and replace it with 18/31 - basically removing the suffix but keep the entire string as such.

Expected value

test_val1 = 'E 18 ST AND A AVE'
test_val2 = 'E 31 ST AND A AVE'

Please note that I do not want to remove the "St" which corresponds to 'street', so a blind replacement is not possible.

My approach was to use below (for 'th' at the moment), but it doesn't work since the function cannot keep the value/text in memory to return it.

import regex as re
test_val1.replace('\d{1,}TH', '\d{1,}', regex=True)

I have a column full of these values, so a solution that I can run/apply on a Pnadas column would be really helpful.

Upvotes: 3

Views: 824

Answers (4)

Timus
Timus

Reputation: 11351

For the following sample dataframe

df = pd.DataFrame({"Test": ['E 18TH ST AND A AVE', 'E 31ST ST AND A AVE']})
                  Test
0  E 18TH ST AND A AVE
1  E 31ST ST AND A AVE

this

df.Test = df.Test.str.replace(r'(\d+)(TH|ST)', lambda m: m.group(1), regex=True)

produces

                Test
0  E 18 ST AND A AVE
1  E 31 ST AND A AVE

Is that what you are looking for? Check out the docs for more details.

The lambda function is used as a repl function ("replace") whose returns replace the pattern matches in the strings. Per definition it gets as argument the respective match object and has to return a string, usually derived from the match object, but it could be totally unrelated. The function here returns the content of the 1. capture group via the match object method group: The (\d+)-part.

Upvotes: 2

Pruthvi Reddy
Pruthvi Reddy

Reputation: 43

Take a sample value

test_val = 'E 32ND ST AND A AVE'

I wrapped it in a quick function

import regex as re
def street_suffix_remover(in_val):
    for x in re.findall(r"(\d{1,}TH|\d{1,}RD|\d{1,}ST|\d{1,}ND)", in_val):
        in_val = in_val.replace(x, re.sub(r"TH|ST|RD|ND","", x)) 
    return(in_val)

Test it on a sample

street_suffix_remover(test_val)

Output

E 32 ST AND A AVE

Ran it on a dataframe as below

test_df['address'] = test_df.apply(lambda row:street_suffix_remover(row['address']), axis=1)

Upvotes: 0

Francisco Moretti
Francisco Moretti

Reputation: 151

I think I can help with the REGEX replacement. It seems like the function that you want to use is actually sub instead of replace. This is the function signature:

re.sub(pattern, repl, string[, count, flags])

Check the official documentation.

Also here is an outstanding answer to a similar question.

Upvotes: 1

Srikanth
Srikanth

Reputation: 11

You mentioned it doesn't work since the function cannot keep the value/text in memory to return it. Is it mandatory NOT to store the value to a different variable?

t1 = 'E 18TH ST AND A AVE'

for t1 in column:    #t1 is address in the dataframe column

t2 = t1.split()
    
t2[1] = re.sub(r'(TH|ST)', '',t2[1])
    
 t1 = ' '.join(t2)

Upvotes: 1

Related Questions