chicagobeast12
chicagobeast12

Reputation: 695

Regex - removing everything after first word following a comma

I have a column that has name variations that I'd like to clean up. I'm having trouble with the regex expression to remove everything after the first word following a comma.

d = {'names':['smith,john s','smith, john', 'brown, bob s', 'brown, bob']}
x = pd.DataFrame(d)

Tried:
x['names'] =  [re.sub(r'/.\s+[^\s,]+/','', str(x)) for x in x['names']]

Desired Output:
['smith,john','smith, john', 'brown, bob', 'brown, bob']

Not sure why my regex isn't working, but any help would be appreciated.

Upvotes: 2

Views: 78

Answers (2)

user1717828
user1717828

Reputation: 7223

You could try using a regex that looks for a comma, then an optional space, then only keeps the remaining word:

x["names"].str.replace(r"^([^,]*,\s*[^\s]*).*", r"\1")

0     smith,john
1    smith, john
2     brown, bob
3     brown, bob
Name: names, dtype: object

Upvotes: 1

Chris Maurer
Chris Maurer

Reputation: 2600

Try re.sub(r'/(,\s*\w+).*$/','$1', str(x))...

Put the triggered pattern into capture group 1 and then restore it in what gets replaced.

Upvotes: 0

Related Questions