Stephen Juza
Stephen Juza

Reputation: 293

Using regex to remove unwanted end of a string

I'm struggling a little with some regex execution to remove trailing extraneous characters. I've tried a few ideas that I found here, but none are quite what I'm looking for.

Data looks like this (only one column of data):

City1[edit]

City2 (University Name)

City with a Space (University Name)

Etc.

Basically, the trouble that I run into here is I can't necessarily remove everything after a space because sometimes a city name includes a space ("New York City").

However, what I think I could do is a three step approach:

  1. Replace anything between [],(),{} sets of characters (this will remove the "edit" and the "University Name" in the sample data.
  2. Replace the [],(),{} type characters since those are now extra characters.
  3. Trim any trailing spaces (which will leave the spaces in city names such as St. Paul)

I have two main questions: 1. Is there a way to do this in one command, or will it have to be three separate commands? 2. How do you remove characters in between specific characters using regex?

Code that I have attempted:

  1. DF[0].replace(r'[^0-9a-zA-Z*]$', "", regex=True, inplace = True)---however this only replaced the final iteration of the special characters

  2. DF[0].replace(r'[\W+$|^0-9a-zA-Z*]',"",regex=True, inplace=True)--unfortunately this just replaced everything, leaving all my data blank

Upvotes: 2

Views: 1857

Answers (3)

piRSquared
piRSquared

Reputation: 294258

option with split
look for zero or one space followed by a [, (, or {
split at that point and take first part

df.names.str.split(r'\s*[\[\{\(]').str[0]

0                City1
1                City2
2    City with a Space
Name: names, dtype: object

Upvotes: 0

Jacobm001
Jacobm001

Reputation: 4539

A regexp would be a relatively easy way to do this.

import re

p = re.compile('(\(|\[|\{)[A-Za-z\ ].+(\)|\]|\})')
dirty = 'City with a Space (University Name)'
cleaned = p.sub('', dirty).strip()
print(cleaned)

Upvotes: 0

Ted Petrou
Ted Petrou

Reputation: 61967

If you always know the bracket characters that will come first you can do:

Create data

df=pd.DataFrame({'names':['City1[edit]', 
                          'City2 (University Name)', 
                           'City with a Space {University Name}']})

Then replace everything after first bracket.

df.names.str.replace('\[.*|\(.*|\{.*', '').str.strip()

Output

0                City1
1                City2
2    City with a Space

Upvotes: 3

Related Questions