bellavera
bellavera

Reputation: 23

How do I remove part of a string that follows a certain pattern up to, but not including another pattern using R?

I have a dataframe in R that contains people data. First part of a string is a full name. Every so often I encounter a nickname in brackets. There could be other data enclosed in brackets that I do not want to delete. Here is an example of a kind of data I am working with:

Name <- c(
    "JOSEPH RYAN SMITH (USRID1)",
    "ANDREA J LOPEZ RAMIREZ (USRID2) (CONTRACTOR)",
    "TIMOTHY (TIM) JOHNSON (USRID3) (INTERN)",
    "JESSICA JENNIFER JONES (USRID4) (CONTRACTOR)",
    "WILLIAM (BILLIE) JOEL (USRID5)")
df <- as.data.frame(Name)

I get:

                                         Name
1                   JOSEPH RYAN SMITH (USRID1)
2 ANDREA J LOPEZ RAMIREZ (USRID2) (CONTRACTOR)
3      TIMOTHY (TIM) JOHNSON (USRID3) (INTERN)
4 JESSICA JENNIFER JONES (USRID4) (CONTRACTOR)
5               WILLIAM (BILLIE) JOEL (USRID5)

I only want to remove nicknames. I noticed that what sets a nickname apart is that it is always in brackets and is always followed by a last name. All other indicators included in brackets are followed by " (" or end of record. I tried to remove a string that is in brackets that is followed by a space and a character A-Z.

df$Name <- str_remove(df$Name, "[\\(][A-Z]+[\\)][ ][A-Z]")

This removed the first letter of the last name and gave me:

 Name
1                   JOSEPH RYAN SMITH (USRID1)
2 ANDREA J LOPEZ RAMIREZ (USRID2) (CONTRACTOR)
3             TIMOTHY OHNSON (USRID3) (INTERN)
4 JESSICA JENNIFER JONES (USRID4) (CONTRACTOR)
5                         WILLIAM OEL (USRID5)

I also unsuccessfully tried "not followed by (" like this:

df$Name <- str_remove(df$Name, "[\\(][A-Z]+[\\)][ ][^\\(]")

I tried a few other things which removed other indicators that are in brackets that I do need to keep. Any help is appreciated. Thank you.

Upvotes: 2

Views: 251

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 388982

Use positive lookeahd (?=) so that first letter of last name is matched but not removed.

stringr::str_remove(df$Name, "\\([A-Z]+\\)\\s(?=[A-Z])")

#[1] "JOSEPH RYAN SMITH (USRID1)"                  
#[2] "ANDREA J LOPEZ RAMIREZ (USRID2) (CONTRACTOR)"
#[3] "TIMOTHY JOHNSON (USRID3) (INTERN)"           
#[4] "JESSICA JENNIFER JONES (USRID4) (CONTRACTOR)"
#[5] "WILLIAM JOEL (USRID5)" 

You can also write this in base R with sub :

sub('\\([A-Z]+\\)\\s(?=[A-Z])', '', df$Name, perl = TRUE)

Upvotes: 3

Related Questions