Omar Gonzales
Omar Gonzales

Reputation: 4008

How to extract 1 or 2 words before n digits?

I've this sample data frame:

address <- c("11537 W LARKSPUR RD EL MIRAGE 85335", "6702 E CPT DREYFUS SCOTTSDALE 85254", "114 S PUEBLO ST GILBERT 85233", "16981 W YOUNG ST SURPRISE 85388")
person <- c("Maria", "Jose", "Adan", "Eva")

my_address <- tibble(person, address)

I need to extract the city from the address column. The City could consist of 1 word or 2, but they are always before the ZIP CODE that consist of 5 digits.

From the data frame, I would like to get: "EL MIRAGE", "SCOTTSDALE" and "GILBERT" in a new column: city

Important:

The cities are always after a 2 or 3 letter word like: ST, AVE, RD.

For example, from: "16981 W YOUNG ST SURPRISE 85388". I'd like to get SURPRISE that is after "ST".

So, I was trying this regex:

my_address$city <-gsub("(.*)([a-zA-Z])([0-9]{5})(.*)", "\\2", my_address$address)

But it return all the text in the column, not the desired cities. Also, I notice that I didn't instruct it to check for 1 or 2 words before 5 digits, so It would extract only 1 word?

UPDATE 1:

string1 <- "114 S PUEBLO ST GILBERT 85233"
sapply(stringr::str_extract_all(string1,"\\w{4,}"),"[",3)

returns: 85233, when GILBERT was expected.

Upvotes: 2

Views: 106

Answers (2)

NelsonGon
NelsonGon

Reputation: 13309

Normally prefer one liners although this seems overly complicated and will require another step to remove "ST" before "SURPRISE". It has been done here assuming everything starts with "ST".

 library(stringr)
 new_s<-unlist(str_extract_all(my_address$address,"\\w{2,} \\w{3,}"))
 newer_s<-str_remove_all(new_s,"^\\w{3}.*\\D$")
 newer_s<-str_remove_all(newer_s,"\\s.*\\d")
 res<-str_remove_all(newer_s,"^ST ")
 res[res==""]<-NA 
 my_address$city<-res[complete.cases(res)]

Result:

 my_address
# A tibble: 4 x 3
#  person address                             city      
#  <chr>  <chr>                               <chr>     
#1 Maria  11537 W LARKSPUR RD EL MIRAGE 85335 EL MIRAGE 
#2 Jose   6702 E CPT DREYFUS SCOTTSDALE 85254 SCOTTSDALE
#3 Peter  16981 W YOUNG ST SURPRISE 85388     SURPRISE  
#4 Paul   114 S PUEBLO ST GILBERT 85233       GILBERT 

Data:

address <- c("11537 W LARKSPUR RD EL MIRAGE 85335", "6702 E CPT DREYFUS SCOTTSDALE 85254",
             "16981 W YOUNG ST SURPRISE 85388","114 S PUEBLO ST GILBERT 85233")
person <- c("Maria", "Jose","Peter","Paul")

my_address <- tibble::tibble(person, address)

Upvotes: 2

Wimpel
Wimpel

Reputation: 27732

This dplyr+stringr / tidyverse solution is based on the fact that you know what 2-3 letter words preceed a city...

# vector with  2-3 letter words before a city?
v.before <- c("ST", "RD", "AVE")
#with this vector, we can build an 'or'-pattern for a regex    

library( dplyr )
library( stringr )
data.frame( person, address) %>% 
  mutate( place = stringr::str_extract( address, paste0("(?<=", paste0(v.before, collapse = " |" ), " ).*(?= [0-9]{5})") ) ) %>%
  #no match found?, then the city is the second last word from address
  mutate( place = ifelse( is.na( place ), stringr::word(address, -2), place))

#   person                             address      place
# 1  Maria 11537 W LARKSPUR RD EL MIRAGE 85335  EL MIRAGE
# 2   Jose 6702 E CPT DREYFUS SCOTTSDALE 85254 SCOTTSDALE
# 3   Adan       114 S PUEBLO ST GILBERT 85233    GILBERT
# 4    Eva     16981 W YOUNG ST SURPRISE 85388   SURPRISE

Upvotes: 2

Related Questions