Daniel Cruz
Daniel Cruz

Reputation: 40

Regex to remove specific parts of a string in a column dataframe python

I'm working with a dataframe which contains addresses and I want to delete a specfic part of a string. Like for example dataset of addresses

And I want to delete the string since taking the words "REFERENCE:" and "reference:" to the end of the sentence. Also I want to create a new column that looks something like this (without the word REFERENCE:/reference: and the next letter of those words) Could you help me to do it in Regex? I want that it the new column looks something like this: edit_column

Upvotes: 0

Views: 369

Answers (2)

The regex should look like this:

import re

discard_re = re.compile('(reference:.*)', re.IGNORECASE | re.MULTILINE)

then you can add the new column:

df['address_new'] = df.addresses.map(lambda x: discard_re.sub('', x))

Upvotes: 1

gold_cy
gold_cy

Reputation: 14216

You can use some regex to obtain the desired results.

df = pd.DataFrame({"address": ["Street Pases de la Reforma #200 REFERENCE: Green house", "Street Carranza #300 12 & 13 REFERENCE: There is a tree"]})

df.address.str.findall(r".+?(?=REFERENCE)").explode()

0    Street Pases de la Reforma #200 
1       Street Carranza #300 12 & 13

Explanation of the regex pattern:

.+? matches any character (except for line terminators)
+? Quantifier — Matches between one and unlimited times, as few times as possible, expanding as needed (lazy)
Positive Lookahead (?=REFERENCE)

Upvotes: 1

Related Questions