hpesoj626
hpesoj626

Reputation: 3629

Extract the last two strings of words separated by the last comma

Consider the the data frame df which contains the column location.

df <- structure(
  list(location = c("International Society for Horticultural Science (ISHS), Leuven, Belgium",
                    "International Society for Horticultural Science (ISHS), Leuven, Belgium",
                    "White House, Jodhpur, India", "Crop Science Society of the Philippines, College, Philippines",
                    "Crop Science Society of the Philippines, College, Philippines",
                    "Institute of Forest Science, Kangwon National University, Kangwon, Korea Republic")), 
  .Names = "location", 
  row.names = c(NA, -6L), 
  class = "data.frame")

I was trying to extract the address from location. The address should contain the string of words that are separated by the last comma. How do I do this? I have been trying to learn regular expressions but my knowledge is not up to par. Here is what I tried:

library(tidyverse)
df %>% mutate(address = str_extract(location, "[:alpha:]+$")) %>% select(address)

This outputs

#       address
# 1     Belgium
# 2     Belgium
# 3       India
# 4 Philippines
# 5 Philippines
# 6    Republic

Here is my desired output:

#                   address
# 1         Leuven, Belgium
# 2         Leuven, Belgium
# 3          Jodhpur, India
# 4    College, Philippines
# 5    College, Philippines
# 6 Kangwon, Korea Republic

Upvotes: 1

Views: 838

Answers (3)

Ronak Shah
Ronak Shah

Reputation: 389335

Like you even my knowledge of regex is not up to par. So after trying to find out different ways to do it in regex, I gave up and use the traditional approach.

sapply(strsplit(df$location, ","), function(x) paste0(tail(x, 2), collapse = ","))

#[1] " Leuven, Belgium"         " Leuven, Belgium"        
#[3] " Jodhpur, India"          " College, Philippines"   
#[5] " College, Philippines"    " Kangwon, Korea Republic"

Here we split location on "," and select the last two instances using tail and paste them together with "," again to get the required output.


And I finally got some time to get the regex thing working.

library(stringi)
stri_extract(df$location, regex = "[^,]+,[^,]+$")

#[1] " Leuven, Belgium"         " Leuven, Belgium"        
#[3] " Jodhpur, India"          " College, Philippines"   
#[5] " College, Philippines"    " Kangwon, Korea Republic"

Upvotes: 1

Melissa Key
Melissa Key

Reputation: 4551

This works:

df %>%
  mutate(address = str_extract(location, "([[:alpha:]]+ ?)+, ([[:alpha:]]+ ?)+$"))

The pattern [[:alpha:]]+ ? matches a string of letters, possibly followed by a space. Wrapping it in parenthesis followed by a + to look for that entire pattern showing up at least one time.

Upvotes: 1

user9628338
user9628338

Reputation:

[a-z A-Z]+, [a-z A-Z]+$

This might work

Upvotes: 1

Related Questions