Reputation: 10491
We have a string column in our database with values for sports teams. The names of these teams are occasionally prefixed with the team's ranking, like such: (13) Miami (FL)
. Here the 13 is Miami's rank, and the (FL) means this is Miami Florida, not Miami of Ohio (Miami (OH)
):
We need to clean up this string, removing (13)
and keeping only Miami (FL)
. So far we've used gsub
and tried the following:
> gsub("\\s*\\([^\\)]+\\)", "", "(13) Miami (FL)")
[1] " Miami"
This is incorrectly removing the (FL) suffix, and it's also not handling the white space correctly in front.
Here's a few additional school names, to show a bit the data we're working with. Note that not every school has the (##) prefix.:
c("North Texas", "Southern Methodist", "Texas-El Paso",
"Brigham Young", "Winner", "(12) Miami (FL)", "Appalachian State",
"Arkansas State", "Army", "(1) Clemson",
"(14) Georgia Southern")
Upvotes: 0
Views: 65
Reputation: 21440
Another solution, based on stringr
, is this:
str_extract(v1, "[A-Z].*")
[1] "North Texas" "Southern Methodist" "Texas-El Paso" "Brigham Young" "Winner"
[6] "Miami (FL)" "Appalachian State" "Arkansas State" "Army" "Clemson"
[11] "Georgia Southern"
This extracts everything starting from the first upper case letter (thereby ignoring the unwanted rankings).
Upvotes: 0
Reputation: 887971
We can match the opening (
followed by one or more digits (\\d+
), then the closing )
) and one or more spaces (\\s+
), replace with blanks (""
)
sub("\\(\\d+\\)\\s+", "", "(13) Miami (FL)")
#[1] "Miami (FL)"
Using the OP' updated example
sub("\\(\\d+\\)\\s+", "", v1)
#[1] "North Texas" "Southern Methodist" "Texas-El Paso" "Brigham Young" "Winner" "Miami (FL)"
#[7] "Appalachian State" "Arkansas State" "Army" "Clemson" "Georgia Southern"
Or another option with str_remove
from stringr
library(stringr)
str_remove("(13) Miami (FL)", "\\(\\d+\\)\\s+")
Upvotes: 1
Reputation: 389325
You can use sub
to remove a number in brackets followed by whitespace.
sub("\\(\\d+\\)\\s", "", "(13) Miami (FL)")
#[1] "Miami (FL)"
The regex could be made stricter based on the pattern in data.
Upvotes: 1