paulina dufter
paulina dufter

Reputation: 1

Based on vector of strings extract string from data.table column into new column

in a data.table I have a column with company names that sometimes include the city of that company. Based on a vector of all existing cities I would like to detect if a city name is part of the company name and if yes extract the city into a new column. I used a for loop that loops trough every row of my data.table over all cities within my vector of cities in R. This takes a very long time. Is there a way I can vectorize this operation to make it more efficient computationally.

Company_name Location
Company 1 Berlin Gmbh. NA
Dresden Company 2 Gmbh. NA
Company 3 in Hamburg NA
Company 4 Ldt NA
Company_name Location
Company 1 Berlin Gmbh. Berlin
Dresden Company 2 Gmbh. Dresden
Company 3 in Hamburg Hamburg
Company 4 Ldt NA

Upvotes: 0

Views: 435

Answers (1)

langtang
langtang

Reputation: 24722

df[, city:=stringr::str_extract(Company, paste0(cities,collapse = "|"))]

OR

# this also works
df[, city:=cities[sapply(cities, \(x) grepl(x,Company))], by=1:nrow(df)]

Output:

                   Company    city
1:  Company 1 Berlin Gmbh.  Berlin
2: Dresden Company 2 Gmbh. Dresden
3:    Company 3 in Hamburg Hamburg
4:           Company 4 Ldt    <NA>

Input:

library(data.table)
df =data.table(
  Company = c(
  "Company 1 Berlin Gmbh.", 
  "Dresden Company 2 Gmbh.",
  "Company 3 in Hamburg",
  "Company 4 Ldt")
)
cities = c('Berlin','Dresden','Hamburg')

Upvotes: 2

Related Questions