Reputation: 23
I have a dataframe in R that contains people data. First part of a string is a full name. Every so often I encounter a nickname in brackets. There could be other data enclosed in brackets that I do not want to delete. Here is an example of a kind of data I am working with:
Name <- c(
"JOSEPH RYAN SMITH (USRID1)",
"ANDREA J LOPEZ RAMIREZ (USRID2) (CONTRACTOR)",
"TIMOTHY (TIM) JOHNSON (USRID3) (INTERN)",
"JESSICA JENNIFER JONES (USRID4) (CONTRACTOR)",
"WILLIAM (BILLIE) JOEL (USRID5)")
df <- as.data.frame(Name)
I get:
Name
1 JOSEPH RYAN SMITH (USRID1)
2 ANDREA J LOPEZ RAMIREZ (USRID2) (CONTRACTOR)
3 TIMOTHY (TIM) JOHNSON (USRID3) (INTERN)
4 JESSICA JENNIFER JONES (USRID4) (CONTRACTOR)
5 WILLIAM (BILLIE) JOEL (USRID5)
I only want to remove nicknames. I noticed that what sets a nickname apart is that it is always in brackets and is always followed by a last name. All other indicators included in brackets are followed by " (" or end of record. I tried to remove a string that is in brackets that is followed by a space and a character A-Z.
df$Name <- str_remove(df$Name, "[\\(][A-Z]+[\\)][ ][A-Z]")
This removed the first letter of the last name and gave me:
Name
1 JOSEPH RYAN SMITH (USRID1)
2 ANDREA J LOPEZ RAMIREZ (USRID2) (CONTRACTOR)
3 TIMOTHY OHNSON (USRID3) (INTERN)
4 JESSICA JENNIFER JONES (USRID4) (CONTRACTOR)
5 WILLIAM OEL (USRID5)
I also unsuccessfully tried "not followed by (" like this:
df$Name <- str_remove(df$Name, "[\\(][A-Z]+[\\)][ ][^\\(]")
I tried a few other things which removed other indicators that are in brackets that I do need to keep. Any help is appreciated. Thank you.
Upvotes: 2
Views: 251
Reputation: 388982
Use positive lookeahd (?=
) so that first letter of last name is matched but not removed.
stringr::str_remove(df$Name, "\\([A-Z]+\\)\\s(?=[A-Z])")
#[1] "JOSEPH RYAN SMITH (USRID1)"
#[2] "ANDREA J LOPEZ RAMIREZ (USRID2) (CONTRACTOR)"
#[3] "TIMOTHY JOHNSON (USRID3) (INTERN)"
#[4] "JESSICA JENNIFER JONES (USRID4) (CONTRACTOR)"
#[5] "WILLIAM JOEL (USRID5)"
You can also write this in base R with sub
:
sub('\\([A-Z]+\\)\\s(?=[A-Z])', '', df$Name, perl = TRUE)
Upvotes: 3