Reputation: 1393
I have column with address.I want to parse it and just have the state name.Below is my column
structure(list(BreweryName = c("(512) Brewing Company", "0 Mile Brewing Company",
"10 Barrel Brewing", "10 Barrel Brewing - Eastside Pub", "10 Barrel Brewing - Portland Pub",
"10 Barrel Brewing Co."), BreweryAddress = c("407 Radam LnSte F200Austin, Texas, 78745-1197United States(512) 921-1545",
"11 W 2nd StHummelstown, Pennsylvania, 17036-1506United States(717) 319-0133",
"1501 E StSan Diego, California, 92101United States", "62950 NE 18th StBend, Oregon, 97701United States(541) 241-7733",
"1411 NW Flanders StPortland, Oregon, 97209-2620United States(541) 585-1007",
"830 W Bannock StBoise, Idaho, 83702-5857United States(208) 344-5870"
)), row.names = c(4L, 6L, 8L, 10L, 12L, 14L), class = "data.frame")
I have another vector from there I want to compare it and replace .
v<- c("Texas","Pennsylvania","Oregon","Oregon","Idaho")
I did try using match
and grep
but it returned with NA's
.
Upvotes: 0
Views: 63
Reputation: 11480
A baseR solution using regmatches, gregexpr
data:
df1 <-
structure(list(BreweryName = c("(512) Brewing Company", "0 Mile Brewing Company",
"10 Barrel Brewing", "10 Barrel Brewing - Eastside Pub", "10 Barrel Brewing - Portland Pub",
"10 Barrel Brewing Co."), BreweryAddress = c("407 Radam LnSte F200Austin, Texas, 78745-1197United States(512) 921-1545",
"11 W 2nd StHummelstown, Pennsylvania, 17036-1506United States(717) 319-0133",
"1501 E StSan Diego, California, 92101United States", "62950 NE 18th StBend, Oregon, 97701United States(541) 241-7733",
"1411 NW Flanders StPortland, Oregon, 97209-2620United States(541) 585-1007",
"830 W Bannock StBoise, Idaho, 83702-5857United States(208) 344-5870"
)), row.names = c(4L, 6L, 8L, 10L, 12L, 14L), class = "data.frame")
v <- c("Texas","Pennsylvania","Oregon","Oregon","Idaho")
code:
v_mod <- paste0(v, collapse="|")
df1$states <- sapply(regmatches(df1$BreweryAddress, gregexpr(v_mod, df1$BreweryAddress)), function(x){if(length(x)==0) NA else x})
result:
# BreweryName BreweryAddress states
#4 (512) Brewing Company 407 Radam LnSte F200Austin, Texas, 78745-1197United States(512) 921-1545 Texas
#6 0 Mile Brewing Company 11 W 2nd StHummelstown, Pennsylvania, 17036-1506United States(717) 319-0133 Pennsylvania
#8 10 Barrel Brewing 1501 E StSan Diego, California, 92101United States <NA>
#10 10 Barrel Brewing - Eastside Pub 62950 NE 18th StBend, Oregon, 97701United States(541) 241-7733 Oregon
#12 10 Barrel Brewing - Portland Pub 1411 NW Flanders StPortland, Oregon, 97209-2620United States(541) 585-1007 Oregon
#14 10 Barrel Brewing Co. 830 W Bannock StBoise, Idaho, 83702-5857United States(208) 344-5870 Idaho
Upvotes: 0
Reputation: 1418
library stringr has easy approach for this
v<- c("Texas","Pennsylvania","Oregon","Oregon","Oregon","Idaho")
library(stringr)
demographics$State <- str_extract(demographics$BreweryAddress,fixed(v, ignore_case=TRUE)) ##i have saved your data as demographics data frame.
Upvotes: 0
Reputation: 15062
Here is a tidyverse
solution. We can basically concatenate the states into a single pattern with |
as the separator, to indicate any of them can be options, and then extract from the address column. This is pretty rough (what if a brewery is at Idaho Ave?) but depending on your data this is probably enough.
library(tidyverse)
df <- structure(list(BreweryName = c("(512) Brewing Company", "0 Mile Brewing Company", "10 Barrel Brewing", "10 Barrel Brewing - Eastside Pub", "10 Barrel Brewing - Portland Pub", "10 Barrel Brewing Co."), BreweryAddress = c("407 Radam LnSte F200Austin, Texas, 78745-1197United States(512) 921-1545", "11 W 2nd StHummelstown, Pennsylvania, 17036-1506United States(717) 319-0133", "1501 E StSan Diego, California, 92101United States", "62950 NE 18th StBend, Oregon, 97701United States(541) 241-7733", "1411 NW Flanders StPortland, Oregon, 97209-2620United States(541) 585-1007", "830 W Bannock StBoise, Idaho, 83702-5857United States(208) 344-5870")), row.names = c(4L, 6L, 8L, 10L, 12L, 14L), class = "data.frame")
v <- c("Texas","Pennsylvania","Oregon","Oregon","Idaho")
df %>%
mutate(State = str_extract(BreweryAddress, str_c(v, collapse = "|")))
#> BreweryName
#> 1 (512) Brewing Company
#> 2 0 Mile Brewing Company
#> 3 10 Barrel Brewing
#> 4 10 Barrel Brewing - Eastside Pub
#> 5 10 Barrel Brewing - Portland Pub
#> 6 10 Barrel Brewing Co.
#> BreweryAddress
#> 1 407 Radam LnSte F200Austin, Texas, 78745-1197United States(512) 921-1545
#> 2 11 W 2nd StHummelstown, Pennsylvania, 17036-1506United States(717) 319-0133
#> 3 1501 E StSan Diego, California, 92101United States
#> 4 62950 NE 18th StBend, Oregon, 97701United States(541) 241-7733
#> 5 1411 NW Flanders StPortland, Oregon, 97209-2620United States(541) 585-1007
#> 6 830 W Bannock StBoise, Idaho, 83702-5857United States(208) 344-5870
#> State
#> 1 Texas
#> 2 Pennsylvania
#> 3 <NA>
#> 4 Oregon
#> 5 Oregon
#> 6 Idaho
Created on 2018-09-25 by the reprex package (v0.2.0).
Upvotes: 1
Reputation: 520908
Here is a base R option using grepl
:
v <- c("Texas","Pennsylvania","Oregon","Oregon","Idaho")
states <- paste0("\\b", v, "\\b", collapse="|")
states
[1] "\\bTexas\\b|\\bPennsylvania\\b|\\bOregon\\b|\\bOregon\\b|\\bIdaho\\b"
df[grepl(states, df$BreweryAddress), ]
I printed out states
so that it is clear what regex pattern we are using to search the brewery addresses. We are using an alternation of each state name, enclosed in word boundary markers. This ensures that we don't accidentally match a string which happened to contain some state name as a substring.
Upvotes: 2