LearningCode
LearningCode

Reputation: 53

Regular expression to clean up names

I have two dataframes of names. The dataframe is longer, but I am using top3 as examples.

First list name examples: 
JOSEPH W. JOHN
MIMI N. ALFORD
WANG E. Li

Second list name examples:
AAMIR, DENNIS M
MAHAMMED, LINDA X
ABAD, FARLEY J

I need to extract the first name in those two dfs, how can I extract them in one regular expression.

The return should be 
list 1
JOSHEPH 
MIMI
WANT

list 2
DNNIES
LINDA
FARLEY

My current code looks as re.search(r'(?<=,)\w+', df['name']), but it didn't return the right name. Is it possible to write two regular expression code in Python to extract those names?

Upvotes: 1

Views: 423

Answers (2)

Ryszard Czech
Ryszard Czech

Reputation: 18611

Use

df['First Name'] = df['name'].str.extract(r'(?:(?<=^(?!.*,))|(?<=, ))([A-Z]+)', expand=False)

See proof

Explanation

--------------------------------------------------------------------------------
  (?:                      group, but do not capture:
--------------------------------------------------------------------------------
    (?<=                     look behind to see if there is:
--------------------------------------------------------------------------------
      ^                        the beginning of the string
--------------------------------------------------------------------------------
      (?!                      look ahead to see if there is not:
--------------------------------------------------------------------------------
        .*                       any character except \n (0 or more
                                 times (matching the most amount
                                 possible))
--------------------------------------------------------------------------------
        ,                        ','
--------------------------------------------------------------------------------
      )                        end of look-ahead
--------------------------------------------------------------------------------
    )                        end of look-behind
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    (?<=                     look behind to see if there is:
--------------------------------------------------------------------------------
      ,                        ', '
--------------------------------------------------------------------------------
    )                        end of look-behind
--------------------------------------------------------------------------------
  )                        end of grouping
--------------------------------------------------------------------------------
  (                        group and capture to \1:
--------------------------------------------------------------------------------
    [A-Z]+                   any character of: 'A' to 'Z' (1 or more
                             times (matching the most amount
                             possible))
--------------------------------------------------------------------------------
  )                        end of \1

Upvotes: 1

alani
alani

Reputation: 13079

It appears that what you want to look for here is the first sequence of word characters that does not have a comma anywhere after it on the line, rather than one that does have a comma before it. So instead of your positive look-behind assertion, it seems that you will want a negative look-ahead assertion.

Try using as your regex:

r'\w+(?!.*,)'

Apply this using:

df['name'].apply(lambda name:re.search(r'\w+(?!.*,)',name).group())

Applying the above to this example dataframe:

                name   foo
0     JOSEPH W. JOHN     1
1     MIMI N. ALFORD     3
2         WANG E. Li     3
3    AAMIR, DENNIS M     3
4  MAHAMMED, LINDA X     3
5     ABAD, FARLEY J     3

gives:

0    JOSEPH
1      MIMI
2      WANG
3    DENNIS
4     LINDA
5    FARLEY

Upvotes: 1

Related Questions