Canovice
Canovice

Reputation: 10491

In R, remove substring pattern from string with gsub

We have a string column in our database with values for sports teams. The names of these teams are occasionally prefixed with the team's ranking, like such: (13) Miami (FL). Here the 13 is Miami's rank, and the (FL) means this is Miami Florida, not Miami of Ohio (Miami (OH)):

We need to clean up this string, removing (13) and keeping only Miami (FL). So far we've used gsub and tried the following:

> gsub("\\s*\\([^\\)]+\\)", "", "(13) Miami (FL)")
[1] " Miami"

This is incorrectly removing the (FL) suffix, and it's also not handling the white space correctly in front.

Edit

Here's a few additional school names, to show a bit the data we're working with. Note that not every school has the (##) prefix.:

c("North Texas", "Southern Methodist", "Texas-El Paso", 
  "Brigham Young", "Winner", "(12) Miami (FL)", "Appalachian State", 
  "Arkansas State", "Army", "(1) Clemson", 
  "(14) Georgia Southern")

Upvotes: 0

Views: 65

Answers (3)

Chris Ruehlemann
Chris Ruehlemann

Reputation: 21440

Another solution, based on stringr, is this:

str_extract(v1, "[A-Z].*")
 [1] "North Texas"        "Southern Methodist" "Texas-El Paso"      "Brigham Young"      "Winner"            
 [6] "Miami (FL)"         "Appalachian State"  "Arkansas State"     "Army"               "Clemson"           
[11] "Georgia Southern"

This extracts everything starting from the first upper case letter (thereby ignoring the unwanted rankings).

Upvotes: 0

akrun
akrun

Reputation: 887971

We can match the opening ( followed by one or more digits (\\d+), then the closing )) and one or more spaces (\\s+), replace with blanks ("")

sub("\\(\\d+\\)\\s+", "",  "(13) Miami (FL)")
#[1] "Miami (FL)"

Using the OP' updated example

sub("\\(\\d+\\)\\s+", "",  v1)
#[1] "North Texas"        "Southern Methodist" "Texas-El Paso"      "Brigham Young"      "Winner"             "Miami (FL)"        
#[7] "Appalachian State"  "Arkansas State"     "Army"               "Clemson"            "Georgia Southern"  

Or another option with str_remove from stringr

library(stringr)
str_remove("(13) Miami (FL)", "\\(\\d+\\)\\s+")

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389325

You can use sub to remove a number in brackets followed by whitespace.

sub("\\(\\d+\\)\\s", "", "(13) Miami (FL)")
#[1] "Miami (FL)"

The regex could be made stricter based on the pattern in data.

Upvotes: 1

Related Questions