Ankit
Ankit

Reputation: 359

Faster alternative methods to for-loop in R for pattern matching

I am working on a problem in which I have to two data frames data and abbreviations and I would like to replace all the abbreviations present in data to their respective full forms. Till now I was using for-loops in the following manner

abb <- c()
for(i in 1:length(data$text)){
  for(j in 1:length(AbbreviationList$Abb)){
    abb <- paste("(\\b", AbbreviationList$Abb[j], "\\b)", sep="")
    data$text[i] <- gsub(abb, AbbreviationList$Fullform[j], tolower(data$text[i]))
  }
}

The abbreviation data frame looks something like the image below and can be generated using the following code

enter image description here

Abbreviation <- c(c("hru", "how are you"), 
                  c("asap", "as soon as possible"), 
                  c("bf", "boyfriend"), 
                  c("ur", "your"), 
                  c("u", "you"),
                  c("afk", "away from keyboard"))
Abbreviation <- data.frame(matrix(Abbreviation, ncol=2, byrow=T), row.names=NULL)

names(Abbreviation) <- c("abb","Fullform")

And the data is merely a data frame with 1 columns having text strings in each rows which can also be generated using the following code.

enter image description here

data <- data.frame(unlist(c("its good to see you, hru doing?", 
                            "I am near bridge come ASAP",
                            "Can u tell me the method u used for",
                            "afk so couldn't respond to ur mails",
                            "asmof I dont know who is your bf?")))
names(data) <- "text"

Initially, I had data frame with around 1000 observations and abbreviation of around 100. So, I was able to run the analysis. But now the data has increased to almost 50000 and I am facing difficulty in processing it as there are two for-loops which makes the process very slow. Can you suggest some better alternatives to for-loop and explain with an example how to use it in this situation. If this problem can be solved faster via vectorization method then please suggest how to do that as well.

Thanks for the help!

Upvotes: 0

Views: 2800

Answers (2)

January
January

Reputation: 17090

First of all, clearly there is no need to compile the regular expressions with each iteration of the loop. Also, there is no need to actually loop over data$text: in R, very often you can use a vector where a value could do -- and R will go through all the elements of the vector and return a vector of the same length.

Abbreviation$regex <- sprintf( "(\\b%s\\b)", Abbreviation$abb )

for( j in 1:length( Abbreviation$abb ) ) {
    data$text <- gsub( Abbreviation$regex[j], 
                       Abbreviation$Fullform[j], data$text,
                       ignore.case= T )
 }

The above code works with the example data.

Upvotes: 1

agstudy
agstudy

Reputation: 121568

This should be faster, and without side effect.

mapply(function(x,y){
  abb <- paste0("(\\b", x, "\\b)")
  gsub(abb, y, tolower(data$text))
},abriv$Abb,abriv$Fullform)
  1. gsub is vectorized so no you give it a character vector where matches are sought. Here I give it data$text
  2. I use mapply to avoid the side effect of for.

Upvotes: 1

Related Questions