regex code, how to slove some data entry error

Question

I have two dataframe

df1

name
ADAM, HAFIZ M
ABAD, FARLEY J
CORDDED, NANCY C
BOMBSHAD, WANG D

df2
JOSEPH W. HOLUBKA   
WANG E. JONATHAN
CUCU F. LIU,
WANG C. DANA,
LANDY F. JON

I am hoping to extract the first name of each dataframe. For df1, I need the "first name" portion after "," , the second df, the first name is what I want.

so the returned df is

df1
HAFIZ
FARLEY
NANCY
WANG

df2
JOSEPH
WANG
CUCU
WANG
LANDY

my current code is

  df['name'].str.upper().apply(lambda name:re.search(r'\w+(?!.*,)',name).group())

This regex works for both df, however, I just realized my data has an entry error. In df2, Liu and Dana have a "," at the end which cause the regex to not working.

the error is group() is not an attribute.

Is there anyway I could fix this code? the regex should work for both df

Wiktor Stribiżew · Accepted Answer

You can use

(^(?=[^,]*,?$)[\w'-]+|(?<=, )[\w'-]+)

See the regex demo. This pattern allows matching a name at the initial position in the string if there is a trailing comma in the string.

Use it in Pandas with Series.str.extract vectorized method:

df['first name'] = df['name'].str.upper().str.extract(r"(^(?=[^,]*,?$)[\w'-]+|(?<=, )[\w'-]+)", expand=False)

Regex details

^(?=[^,]*,?$)[\w'-]+ - one or more word, ' and - chars ([\w'-]+) at the start of the string (^) if the string has no commas but may end with an optional comma ((?=[^,]*,?$))
| - or
(?<=, )[\w'-]+ - one or more word, ' and - chars chars preceded with comma + space.

regex code, how to slove some data entry error

Answers (2)

Related Questions