Reputation: 4008
I've this sample data frame:
address <- c("11537 W LARKSPUR RD EL MIRAGE 85335", "6702 E CPT DREYFUS SCOTTSDALE 85254", "114 S PUEBLO ST GILBERT 85233", "16981 W YOUNG ST SURPRISE 85388")
person <- c("Maria", "Jose", "Adan", "Eva")
my_address <- tibble(person, address)
I need to extract the city
from the address
column. The City could consist of 1 word or 2, but they are always before the ZIP CODE that consist of 5 digits.
From the data frame, I would like to get: "EL MIRAGE", "SCOTTSDALE" and "GILBERT" in a new column:
city
Important:
The cities are always after a 2 or 3 letter word like: ST, AVE, RD.
For example, from: "16981 W YOUNG ST SURPRISE 85388". I'd like to get SURPRISE that is after "ST".
So, I was trying this regex:
my_address$city <-gsub("(.*)([a-zA-Z])([0-9]{5})(.*)", "\\2", my_address$address)
But it return all the text in the column, not the desired cities. Also, I notice that I didn't instruct it to check for 1 or 2 words before 5 digits, so It would extract only 1 word?
UPDATE 1:
string1 <- "114 S PUEBLO ST GILBERT 85233"
sapply(stringr::str_extract_all(string1,"\\w{4,}"),"[",3)
returns: 85233
, when GILBERT
was expected.
Upvotes: 2
Views: 106
Reputation: 13309
Normally prefer one liners although this seems overly complicated and will require another step to remove "ST" before "SURPRISE". It has been done here assuming everything starts with "ST".
library(stringr)
new_s<-unlist(str_extract_all(my_address$address,"\\w{2,} \\w{3,}"))
newer_s<-str_remove_all(new_s,"^\\w{3}.*\\D$")
newer_s<-str_remove_all(newer_s,"\\s.*\\d")
res<-str_remove_all(newer_s,"^ST ")
res[res==""]<-NA
my_address$city<-res[complete.cases(res)]
Result:
my_address
# A tibble: 4 x 3
# person address city
# <chr> <chr> <chr>
#1 Maria 11537 W LARKSPUR RD EL MIRAGE 85335 EL MIRAGE
#2 Jose 6702 E CPT DREYFUS SCOTTSDALE 85254 SCOTTSDALE
#3 Peter 16981 W YOUNG ST SURPRISE 85388 SURPRISE
#4 Paul 114 S PUEBLO ST GILBERT 85233 GILBERT
Data:
address <- c("11537 W LARKSPUR RD EL MIRAGE 85335", "6702 E CPT DREYFUS SCOTTSDALE 85254",
"16981 W YOUNG ST SURPRISE 85388","114 S PUEBLO ST GILBERT 85233")
person <- c("Maria", "Jose","Peter","Paul")
my_address <- tibble::tibble(person, address)
Upvotes: 2
Reputation: 27732
This dplyr+stringr / tidyverse solution is based on the fact that you know what 2-3 letter words preceed a city...
# vector with 2-3 letter words before a city?
v.before <- c("ST", "RD", "AVE")
#with this vector, we can build an 'or'-pattern for a regex
library( dplyr )
library( stringr )
data.frame( person, address) %>%
mutate( place = stringr::str_extract( address, paste0("(?<=", paste0(v.before, collapse = " |" ), " ).*(?= [0-9]{5})") ) ) %>%
#no match found?, then the city is the second last word from address
mutate( place = ifelse( is.na( place ), stringr::word(address, -2), place))
# person address place
# 1 Maria 11537 W LARKSPUR RD EL MIRAGE 85335 EL MIRAGE
# 2 Jose 6702 E CPT DREYFUS SCOTTSDALE 85254 SCOTTSDALE
# 3 Adan 114 S PUEBLO ST GILBERT 85233 GILBERT
# 4 Eva 16981 W YOUNG ST SURPRISE 85388 SURPRISE
Upvotes: 2