How to extract 1 or 2 words before n digits?

Question

I've this sample data frame:

address <- c("11537 W LARKSPUR RD EL MIRAGE 85335", "6702 E CPT DREYFUS SCOTTSDALE 85254", "114 S PUEBLO ST GILBERT 85233", "16981 W YOUNG ST SURPRISE 85388")
person <- c("Maria", "Jose", "Adan", "Eva")

my_address <- tibble(person, address)

I need to extract the city from the address column. The City could consist of 1 word or 2, but they are always before the ZIP CODE that consist of 5 digits.

From the data frame, I would like to get: "EL MIRAGE", "SCOTTSDALE" and "GILBERT" in a new column: city

Important:

The cities are always after a 2 or 3 letter word like: ST, AVE, RD.

For example, from: "16981 W YOUNG ST SURPRISE 85388". I'd like to get SURPRISE that is after "ST".

So, I was trying this regex:

my_address$city <-gsub("(.*)([a-zA-Z])([0-9]{5})(.*)", "\2", my_address$address)

But it return all the text in the column, not the desired cities. Also, I notice that I didn't instruct it to check for 1 or 2 words before 5 digits, so It would extract only 1 word?

UPDATE 1:

string1 <- "114 S PUEBLO ST GILBERT 85233"
sapply(stringr::str_extract_all(string1,"\w{4,}"),"[",3)

returns: 85233, when GILBERT was expected.

Wimpel · Accepted Answer

This dplyr+stringr / tidyverse solution is based on the fact that you know what 2-3 letter words preceed a city...

# vector with  2-3 letter words before a city?
v.before <- c("ST", "RD", "AVE")
#with this vector, we can build an 'or'-pattern for a regex    

library( dplyr )
library( stringr )
data.frame( person, address) %>% 
  mutate( place = stringr::str_extract( address, paste0("(?<=", paste0(v.before, collapse = " |" ), " ).*(?= [0-9]{5})") ) ) %>%
  #no match found?, then the city is the second last word from address
  mutate( place = ifelse( is.na( place ), stringr::word(address, -2), place))

#   person                             address      place
# 1  Maria 11537 W LARKSPUR RD EL MIRAGE 85335  EL MIRAGE
# 2   Jose 6702 E CPT DREYFUS SCOTTSDALE 85254 SCOTTSDALE
# 3   Adan       114 S PUEBLO ST GILBERT 85233    GILBERT
# 4    Eva     16981 W YOUNG ST SURPRISE 85388   SURPRISE

How to extract 1 or 2 words before n digits?

Answers (2)

Related Questions