Reputation: 53
I have two dataframe
df1
name
ADAM, HAFIZ M
ABAD, FARLEY J
CORDDED, NANCY C
BOMBSHAD, WANG D
df2
JOSEPH W. HOLUBKA
WANG E. JONATHAN
CUCU F. LIU,
WANG C. DANA,
LANDY F. JON
I am hoping to extract the first name of each dataframe. For df1, I need the "first name" portion after "," , the second df, the first name is what I want.
so the returned df is
df1
HAFIZ
FARLEY
NANCY
WANG
df2
JOSEPH
WANG
CUCU
WANG
LANDY
my current code is
df['name'].str.upper().apply(lambda name:re.search(r'\w+(?!.*,)',name).group())
This regex works for both df, however, I just realized my data has an entry error. In df2, Liu and Dana have a "," at the end which cause the regex to not working.
the error is group() is not an attribute.
Is there anyway I could fix this code? the regex should work for both df
Upvotes: 0
Views: 56
Reputation: 627110
You can use
(^(?=[^,]*,?$)[\w'-]+|(?<=, )[\w'-]+)
See the regex demo. This pattern allows matching a name at the initial position in the string if there is a trailing comma in the string.
Use it in Pandas with Series.str.extract
vectorized method:
df['first name'] = df['name'].str.upper().str.extract(r"(^(?=[^,]*,?$)[\w'-]+|(?<=, )[\w'-]+)", expand=False)
Regex details
^(?=[^,]*,?$)[\w'-]+
- one or more word, '
and -
chars ([\w'-]+
) at the start of the string (^
) if the string has no commas but may end with an optional comma ((?=[^,]*,?$)
)|
- or(?<=, )[\w'-]+
- one or more word, '
and -
chars chars preceded with comma + space.Upvotes: 1
Reputation: 42228
Edit: Trying this again because my first one wasn't all there. You can take the regex from this excellent answer and you only need to change one thing. Where their lookahead matches any comma, we only want to match a comma that is followed by another word. Resulting in:
(?:(?<=^(?!.*, *\w))|(?<=, ))([A-Z]+)
Upvotes: 0