Reputation: 53

Regex Text Cleaning on Multiple forms of text formats

I have a dataframe with multiple forms of names:

JOSEPH W. JASON
Ralph Landau
RAYMOND C ADAMS
ABD, SAMIR
ABDOU TCHOUSNOU, BOUBACAR
ABDL-ALI, OMAR R

For first 3, the rule is last word. For the last three, or anything with comma, the first word is the last name. However, for name like Abdou Tchousnou, I only took the last word, which is Tchousnou.

The expected output is

JASON
LANDAU
ADAMS
ABD
TCHOUNOU
ABDL-ALI

The left is the name, and the right is what I want to return.

str.extract(r'(^(?=[^,]*,?$)[\w-]+|(?<=, )[\w-]+)', expand=False)

Is there anyway to solve this? The current code only returns the first name instead of surname which is the one that I want.

Upvotes: 0

Answers (2)

MonkeyZeus

Reputation: 20737

Something like this would work:

(.+(?=,)|\S+$)

( - start capture group #1
.+(?=,) - get everything before a comma
| - or
\S+$ - get everything which is not a whitespace before the end of the line
) - end capture group #1

https://regex101.com/r/myvyS0/1

Python:

str.extract(r'(.+(?=,)|\S+$)', expand=False)

Upvotes: 1

anubhava

Reputation: 785226

You may use this regex to extract:

>>> print (df)
                       name
0           JOSEPH W. JASON
1              Ralph Landau
2           RAYMOND C ADAMS
3                ABD, SAMIR
4  ABDOU TCHOUSNOU, BOUBACA
5          ABDL-ALI, OMAR R

>>> df['name'].str.extract(r'([^,]+(?=,)|\w+(?:-\w+)*(?=$))', expand=False)
0            JASON
1           Landau
2            ADAMS
3              ABD
4  ABDOU TCHOUSNOU
5         ABDL-ALI

RegEx Details:

(: Start capture group
- [^,]+(?=,): Match 1+ non-comma characters tha
- |: OR
- \w+: Match 1+ word charcters
- (?:-\w+)*: Match - followed 1+ word characters. Match 0 or more of this group
): End capture group
(?=,|$): Lookahead to assert that we have comma or end of line ahead

Upvotes: 0

Regex Text Cleaning on Multiple forms of text formats

Answers (2)

Related Questions