Reputation: 53
I have two dataframes of names. The dataframe is longer, but I am using top3 as examples.
First list name examples:
JOSEPH W. JOHN
MIMI N. ALFORD
WANG E. Li
Second list name examples:
AAMIR, DENNIS M
MAHAMMED, LINDA X
ABAD, FARLEY J
I need to extract the first name in those two dfs, how can I extract them in one regular expression.
The return should be
list 1
JOSHEPH
MIMI
WANT
list 2
DNNIES
LINDA
FARLEY
My current code looks as re.search(r'(?<=,)\w+', df['name'])
, but it didn't return the right name. Is it possible to write two regular expression code in Python to extract those names?
Upvotes: 1
Views: 423
Reputation: 18611
Use
df['First Name'] = df['name'].str.extract(r'(?:(?<=^(?!.*,))|(?<=, ))([A-Z]+)', expand=False)
See proof
Explanation
--------------------------------------------------------------------------------
(?: group, but do not capture:
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
^ the beginning of the string
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
.* any character except \n (0 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
, ','
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
(?<= look behind to see if there is:
--------------------------------------------------------------------------------
, ', '
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
) end of grouping
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[A-Z]+ any character of: 'A' to 'Z' (1 or more
times (matching the most amount
possible))
--------------------------------------------------------------------------------
) end of \1
Upvotes: 1
Reputation: 13079
It appears that what you want to look for here is the first sequence of word characters that does not have a comma anywhere after it on the line, rather than one that does have a comma before it. So instead of your positive look-behind assertion, it seems that you will want a negative look-ahead assertion.
Try using as your regex:
r'\w+(?!.*,)'
Apply this using:
df['name'].apply(lambda name:re.search(r'\w+(?!.*,)',name).group())
Applying the above to this example dataframe:
name foo
0 JOSEPH W. JOHN 1
1 MIMI N. ALFORD 3
2 WANG E. Li 3
3 AAMIR, DENNIS M 3
4 MAHAMMED, LINDA X 3
5 ABAD, FARLEY J 3
gives:
0 JOSEPH
1 MIMI
2 WANG
3 DENNIS
4 LINDA
5 FARLEY
Upvotes: 1