Reputation: 3629
Consider the the data frame df
which contains the column location
.
df <- structure(
list(location = c("International Society for Horticultural Science (ISHS), Leuven, Belgium",
"International Society for Horticultural Science (ISHS), Leuven, Belgium",
"White House, Jodhpur, India", "Crop Science Society of the Philippines, College, Philippines",
"Crop Science Society of the Philippines, College, Philippines",
"Institute of Forest Science, Kangwon National University, Kangwon, Korea Republic")),
.Names = "location",
row.names = c(NA, -6L),
class = "data.frame")
I was trying to extract the address
from location
. The address
should contain the string of words that are separated by the last comma. How do I do this? I have been trying to learn regular expressions but my knowledge is not up to par. Here is what I tried:
library(tidyverse)
df %>% mutate(address = str_extract(location, "[:alpha:]+$")) %>% select(address)
This outputs
# address
# 1 Belgium
# 2 Belgium
# 3 India
# 4 Philippines
# 5 Philippines
# 6 Republic
Here is my desired output:
# address
# 1 Leuven, Belgium
# 2 Leuven, Belgium
# 3 Jodhpur, India
# 4 College, Philippines
# 5 College, Philippines
# 6 Kangwon, Korea Republic
Upvotes: 1
Views: 838
Reputation: 389335
Like you even my knowledge of regex is not up to par. So after trying to find out different ways to do it in regex, I gave up and use the traditional approach.
sapply(strsplit(df$location, ","), function(x) paste0(tail(x, 2), collapse = ","))
#[1] " Leuven, Belgium" " Leuven, Belgium"
#[3] " Jodhpur, India" " College, Philippines"
#[5] " College, Philippines" " Kangwon, Korea Republic"
Here we split location
on "," and select the last two instances using tail
and paste
them together with "," again to get the required output.
And I finally got some time to get the regex thing working.
library(stringi)
stri_extract(df$location, regex = "[^,]+,[^,]+$")
#[1] " Leuven, Belgium" " Leuven, Belgium"
#[3] " Jodhpur, India" " College, Philippines"
#[5] " College, Philippines" " Kangwon, Korea Republic"
Upvotes: 1
Reputation: 4551
This works:
df %>%
mutate(address = str_extract(location, "([[:alpha:]]+ ?)+, ([[:alpha:]]+ ?)+$"))
The pattern [[:alpha:]]+ ?
matches a string of letters, possibly followed by a space. Wrapping it in parenthesis followed by a + to look for that entire pattern showing up at least one time.
Upvotes: 1