LearningCode
LearningCode

Reputation: 53

Regex Text Cleaning on Multiple forms of text formats

I have a dataframe with multiple forms of names:

JOSEPH W. JASON
Ralph Landau
RAYMOND C ADAMS
ABD, SAMIR
ABDOU TCHOUSNOU, BOUBACAR
ABDL-ALI, OMAR R

For first 3, the rule is last word. For the last three, or anything with comma, the first word is the last name. However, for name like Abdou Tchousnou, I only took the last word, which is Tchousnou.

The expected output is

JASON
LANDAU
ADAMS
ABD
TCHOUNOU
ABDL-ALI

The left is the name, and the right is what I want to return.

str.extract(r'(^(?=[^,]*,?$)[\w-]+|(?<=, )[\w-]+)', expand=False)

Is there anyway to solve this? The current code only returns the first name instead of surname which is the one that I want.

Upvotes: 0

Views: 73

Answers (2)

MonkeyZeus
MonkeyZeus

Reputation: 20737

Something like this would work:

(.+(?=,)|\S+$)
  • ( - start capture group #1
  • .+(?=,) - get everything before a comma
  • | - or
  • \S+$ - get everything which is not a whitespace before the end of the line
  • ) - end capture group #1

https://regex101.com/r/myvyS0/1

Python:

str.extract(r'(.+(?=,)|\S+$)', expand=False)

Upvotes: 1

anubhava
anubhava

Reputation: 785226

You may use this regex to extract:

>>> print (df)
                       name
0           JOSEPH W. JASON
1              Ralph Landau
2           RAYMOND C ADAMS
3                ABD, SAMIR
4  ABDOU TCHOUSNOU, BOUBACA
5          ABDL-ALI, OMAR R

>>> df['name'].str.extract(r'([^,]+(?=,)|\w+(?:-\w+)*(?=$))', expand=False)
0            JASON
1           Landau
2            ADAMS
3              ABD
4  ABDOU TCHOUSNOU
5         ABDL-ALI

RegEx Details:

  • (: Start capture group
    • [^,]+(?=,): Match 1+ non-comma characters tha
    • |: OR
    • \w+: Match 1+ word charcters
    • (?:-\w+)*: Match - followed 1+ word characters. Match 0 or more of this group
  • ): End capture group
  • (?=,|$): Lookahead to assert that we have comma or end of line ahead

Upvotes: 0

Related Questions