Reputation: 293
I'm struggling a little with some regex execution to remove trailing extraneous characters. I've tried a few ideas that I found here, but none are quite what I'm looking for.
Data looks like this (only one column of data):
City1[edit]
City2 (University Name)
City with a Space (University Name)
Etc.
Basically, the trouble that I run into here is I can't necessarily remove everything after a space because sometimes a city name includes a space ("New York City").
However, what I think I could do is a three step approach:
I have two main questions: 1. Is there a way to do this in one command, or will it have to be three separate commands? 2. How do you remove characters in between specific characters using regex?
Code that I have attempted:
DF[0].replace(r'[^0-9a-zA-Z*]$', "", regex=True, inplace = True)
---however this only replaced the final iteration of the special characters
DF[0].replace(r'[\W+$|^0-9a-zA-Z*]',"",regex=True, inplace=True)
--unfortunately this just replaced everything, leaving all my data blank
Upvotes: 2
Views: 1857
Reputation: 294258
option with split
look for zero or one space followed by a [
, (
, or {
split at that point and take first part
df.names.str.split(r'\s*[\[\{\(]').str[0]
0 City1
1 City2
2 City with a Space
Name: names, dtype: object
Upvotes: 0
Reputation: 4539
A regexp would be a relatively easy way to do this.
import re
p = re.compile('(\(|\[|\{)[A-Za-z\ ].+(\)|\]|\})')
dirty = 'City with a Space (University Name)'
cleaned = p.sub('', dirty).strip()
print(cleaned)
Upvotes: 0
Reputation: 61967
If you always know the bracket characters that will come first you can do:
Create data
df=pd.DataFrame({'names':['City1[edit]',
'City2 (University Name)',
'City with a Space {University Name}']})
Then replace everything after first bracket.
df.names.str.replace('\[.*|\(.*|\{.*', '').str.strip()
Output
0 City1
1 City2
2 City with a Space
Upvotes: 3