SNT
SNT

Reputation: 1393

Search in a column from a vector of strings using R

I have column with address.I want to parse it and just have the state name.Below is my column

structure(list(BreweryName = c("(512) Brewing Company", "0 Mile Brewing Company", 
"10 Barrel Brewing", "10 Barrel Brewing - Eastside Pub", "10 Barrel Brewing - Portland Pub", 
"10 Barrel Brewing Co."), BreweryAddress = c("407 Radam LnSte F200Austin, Texas, 78745-1197United States(512) 921-1545", 
"11 W 2nd StHummelstown, Pennsylvania, 17036-1506United States(717) 319-0133", 
"1501 E StSan Diego, California, 92101United States", "62950 NE 18th StBend, Oregon, 97701United States(541) 241-7733", 
"1411 NW Flanders StPortland, Oregon, 97209-2620United States(541) 585-1007", 
"830 W Bannock StBoise, Idaho, 83702-5857United States(208) 344-5870"
)), row.names = c(4L, 6L, 8L, 10L, 12L, 14L), class = "data.frame")

I have another vector from there I want to compare it and replace .

v<- c("Texas","Pennsylvania","Oregon","Oregon","Idaho")

I did try using match and grep but it returned with NA's.

Upvotes: 0

Views: 63

Answers (4)

Andre Elrico
Andre Elrico

Reputation: 11480

A baseR solution using regmatches, gregexpr

data:

df1 <-
structure(list(BreweryName = c("(512) Brewing Company", "0 Mile Brewing Company", 
                               "10 Barrel Brewing", "10 Barrel Brewing - Eastside Pub", "10 Barrel Brewing - Portland Pub", 
                               "10 Barrel Brewing Co."), BreweryAddress = c("407 Radam LnSte F200Austin, Texas, 78745-1197United States(512) 921-1545", 
                                                                            "11 W 2nd StHummelstown, Pennsylvania, 17036-1506United States(717) 319-0133", 
                                                                            "1501 E StSan Diego, California, 92101United States", "62950 NE 18th StBend, Oregon, 97701United States(541) 241-7733", 
                                                                            "1411 NW Flanders StPortland, Oregon, 97209-2620United States(541) 585-1007", 
                                                                            "830 W Bannock StBoise, Idaho, 83702-5857United States(208) 344-5870"
                               )), row.names = c(4L, 6L, 8L, 10L, 12L, 14L), class = "data.frame")

v     <- c("Texas","Pennsylvania","Oregon","Oregon","Idaho")

code:

v_mod <- paste0(v, collapse="|")

df1$states <- sapply(regmatches(df1$BreweryAddress, gregexpr(v_mod, df1$BreweryAddress)), function(x){if(length(x)==0) NA else x})

result:

#                        BreweryName                                                              BreweryAddress       states
#4             (512) Brewing Company    407 Radam LnSte F200Austin, Texas, 78745-1197United States(512) 921-1545        Texas
#6            0 Mile Brewing Company 11 W 2nd StHummelstown, Pennsylvania, 17036-1506United States(717) 319-0133 Pennsylvania
#8                 10 Barrel Brewing                          1501 E StSan Diego, California, 92101United States         <NA>
#10 10 Barrel Brewing - Eastside Pub              62950 NE 18th StBend, Oregon, 97701United States(541) 241-7733       Oregon
#12 10 Barrel Brewing - Portland Pub  1411 NW Flanders StPortland, Oregon, 97209-2620United States(541) 585-1007       Oregon
#14            10 Barrel Brewing Co.         830 W Bannock StBoise, Idaho, 83702-5857United States(208) 344-5870        Idaho

Upvotes: 0

Hunaidkhan
Hunaidkhan

Reputation: 1418

library stringr has easy approach for this

v<- c("Texas","Pennsylvania","Oregon","Oregon","Oregon","Idaho")

library(stringr)
demographics$State <- str_extract(demographics$BreweryAddress,fixed(v, ignore_case=TRUE)) ##i have saved your data as demographics data frame. 

Upvotes: 0

Calum You
Calum You

Reputation: 15062

Here is a tidyverse solution. We can basically concatenate the states into a single pattern with | as the separator, to indicate any of them can be options, and then extract from the address column. This is pretty rough (what if a brewery is at Idaho Ave?) but depending on your data this is probably enough.

library(tidyverse)
df <- structure(list(BreweryName = c("(512) Brewing Company", "0 Mile Brewing Company", "10 Barrel Brewing", "10 Barrel Brewing - Eastside Pub", "10 Barrel Brewing - Portland Pub", "10 Barrel Brewing Co."), BreweryAddress = c("407 Radam LnSte F200Austin, Texas, 78745-1197United States(512) 921-1545", "11 W 2nd StHummelstown, Pennsylvania, 17036-1506United States(717) 319-0133", "1501 E StSan Diego, California, 92101United States", "62950 NE 18th StBend, Oregon, 97701United States(541) 241-7733", "1411 NW Flanders StPortland, Oregon, 97209-2620United States(541) 585-1007", "830 W Bannock StBoise, Idaho, 83702-5857United States(208) 344-5870")), row.names = c(4L, 6L, 8L, 10L, 12L, 14L), class = "data.frame")
v <- c("Texas","Pennsylvania","Oregon","Oregon","Idaho")
df %>%
  mutate(State = str_extract(BreweryAddress, str_c(v, collapse = "|")))
#>                        BreweryName
#> 1            (512) Brewing Company
#> 2           0 Mile Brewing Company
#> 3                10 Barrel Brewing
#> 4 10 Barrel Brewing - Eastside Pub
#> 5 10 Barrel Brewing - Portland Pub
#> 6            10 Barrel Brewing Co.
#>                                                                BreweryAddress
#> 1    407 Radam LnSte F200Austin, Texas, 78745-1197United States(512) 921-1545
#> 2 11 W 2nd StHummelstown, Pennsylvania, 17036-1506United States(717) 319-0133
#> 3                          1501 E StSan Diego, California, 92101United States
#> 4              62950 NE 18th StBend, Oregon, 97701United States(541) 241-7733
#> 5  1411 NW Flanders StPortland, Oregon, 97209-2620United States(541) 585-1007
#> 6         830 W Bannock StBoise, Idaho, 83702-5857United States(208) 344-5870
#>          State
#> 1        Texas
#> 2 Pennsylvania
#> 3         <NA>
#> 4       Oregon
#> 5       Oregon
#> 6        Idaho

Created on 2018-09-25 by the reprex package (v0.2.0).

Upvotes: 1

Tim Biegeleisen
Tim Biegeleisen

Reputation: 520908

Here is a base R option using grepl:

v <- c("Texas","Pennsylvania","Oregon","Oregon","Idaho")
states <- paste0("\\b", v, "\\b", collapse="|")
states

[1] "\\bTexas\\b|\\bPennsylvania\\b|\\bOregon\\b|\\bOregon\\b|\\bIdaho\\b"

df[grepl(states, df$BreweryAddress), ]

I printed out states so that it is clear what regex pattern we are using to search the brewery addresses. We are using an alternation of each state name, enclosed in word boundary markers. This ensures that we don't accidentally match a string which happened to contain some state name as a substring.

Upvotes: 2

Related Questions