Reputation: 53
I have a dataframe with multiple forms of names:
JOSEPH W. JASON
Ralph Landau
RAYMOND C ADAMS
ABD, SAMIR
ABDOU TCHOUSNOU, BOUBACAR
ABDL-ALI, OMAR R
For first 3, the rule is last word. For the last three, or anything with comma, the first word is the last name. However, for name like Abdou Tchousnou, I only took the last word, which is Tchousnou.
The expected output is
JASON
LANDAU
ADAMS
ABD
TCHOUNOU
ABDL-ALI
The left is the name, and the right is what I want to return.
str.extract(r'(^(?=[^,]*,?$)[\w-]+|(?<=, )[\w-]+)', expand=False)
Is there anyway to solve this? The current code only returns the first name instead of surname which is the one that I want.
Upvotes: 0
Views: 73
Reputation: 20737
Something like this would work:
(.+(?=,)|\S+$)
(
- start capture group #1.+(?=,)
- get everything before a comma|
- or\S+$
- get everything which is not a whitespace before the end of the line)
- end capture group #1https://regex101.com/r/myvyS0/1
Python:
str.extract(r'(.+(?=,)|\S+$)', expand=False)
Upvotes: 1
Reputation: 785226
You may use this regex to extract:
>>> print (df)
name
0 JOSEPH W. JASON
1 Ralph Landau
2 RAYMOND C ADAMS
3 ABD, SAMIR
4 ABDOU TCHOUSNOU, BOUBACA
5 ABDL-ALI, OMAR R
>>> df['name'].str.extract(r'([^,]+(?=,)|\w+(?:-\w+)*(?=$))', expand=False)
0 JASON
1 Landau
2 ADAMS
3 ABD
4 ABDOU TCHOUSNOU
5 ABDL-ALI
RegEx Details:
(
: Start capture group
[^,]+(?=,)
: Match 1+ non-comma characters tha|
: OR\w+
: Match 1+ word charcters(?:-\w+)*
: Match -
followed 1+ word characters. Match 0 or more of this group)
: End capture group(?=,|$)
: Lookahead to assert that we have comma or end of line aheadUpvotes: 0