Fred
Fred

Reputation: 35

Search and replace multiple strings in list of strings: improve R code

I am looking for a simplified solution to the following problem in R: I have a list of names that are separated by commas – however, some of the names also have commas in them. In order to separate the names, I would like to replace all names with commas first and then split by comma. My problem is that I have around 26 000 strings with several names in each and I have a list of around 130 names with commas. I have written a nested foreach loop (in order to use multiple cores to speed things up) and it works but it’s horribly slow. Is there a quicker way to search in the strings and replace the relevant names? Here is my sample code:

List_of_names<-as.data.frame(c("Fred, Heiko, Franz, Jr., Nice, LLC, Meike","Digital, Mike, John, Sr","Svenja, Sven"))
Comma_names<-as.data.frame(c("Franz, Jr.","Nice, LLC","John, Sr"))
colnames(Comma_names)<-"name"
Comma_names$replace_names<-gsub(",", "",Comma_names[,"name"])

library(doParallel)
library(foreach)
cl<-makeCluster(4) # Create cluster with desired number of cores
registerDoParallel(cl) # Register cluster


names_new<-foreach (i=1:nrow(List_of_names),.errorhandling="pass",.packages=c("foreach")) %dopar% {
  name_2<-List_of_names[i,]
  foreach (j=1:nrow(Comma_names),.combine=rbind,.errorhandling="pass") %do% {
    if(length(grep(Comma_names[j,1],name_2))>0){
      name_2<-gsub(Comma_names[j,1], Comma_names[j,2],name_2)
    }
  }
  name_2
}

In addition, the result of the foreach loop is a list but if I try to save the list or replace the column in my original dataframe it takes forever. How can I change my code to make it faster?

Thank you everyone who is reads this and is able to help!

Upvotes: 0

Views: 621

Answers (1)

thothal
thothal

Reputation: 20329

Principle

You can use a combination from Reduce and stri_replace_all from package stringi.

Code

library(stringi)
Comma_names <- structure(list(name = c("Franz, Jr.", "Nice, LLC", "John, Sr"), 
                              replace_names = c("Franz Jr.", "Nice LLC", "John Sr")), 
                              .Names = c("name", "replace_names"), 
                              row.names = c(NA, -3L), class = "data.frame")


List_of_names <- structure(list(name = c("Fred, Heiko, Franz, Jr., Nice, LLC, Meike",
                                         "Digital, Mike, John, Sr", "Svenja, Sven")), 
                                .Names = "name", 
                                row.names = c(NA, -3L), class = "data.frame")

wrapper <- function(str, ind) stri_replace_all(str, Comma_names$replace_names[ind], 
                                               fixed = Comma_names$name[ind])

ind <- 1:NROW(Comma_names)
Reduce(wrapper, ind, init = List_of_names$name)
# [1] "Fred, Heiko, Franz Jr., Nice LLC, Meike"
# [2] "Digital, Mike, John Sr"                 
# [3] "Svenja, Sven" 

Explanation

stri_replace_all is a fast function which replaces all occurrences in a string. With Reduce you apply a function to the the result of the previous function call. So we apply wrapper to the column with all the names and replace the string in the first row of Comma_names. This string we again feed to wrapper now with the aim to replace all occurrences of the second row and so on. This code should run reasonable fast and you do not need to parallezie. Would be curious to hear your feedback on the execution time.

Benchmark

Just a little benchmark with 3 millions lines:

List_of_names <- List_of_names[rep(1:NROW(List_of_names), 1e6), , drop = FALSE]
system.time(invisible(Reduce(wrapper, ind, init = List_of_names$name)))
# user  system elapsed 
# 1.95    0.00    1.96

Upvotes: 2

Related Questions