Nathalie
Nathalie

Reputation: 1238

How to speed up the proceeds of grepl function?

Trying to use this option into a large number of words and text:

# Create some fake data
words <- c("stock", "revenue", "continuous improvement")
phrases <- c("blah blah stock and revenue", "yada yada revenue yada", 
             "continuous improvement is an unrealistic goal", 
             "phrase with no match")

# Apply the 'grepl' function along the list of words, and convert the result to numeric
df <- data.frame(lapply(words, function(word) {as.numeric(grepl(word, phrases))}))
# Name the columns the words that were searched
names(df) <- words

It takes to much time to implement into large lists and input text

Is there any way to change it to make the process faster?

Upvotes: 2

Views: 233

Answers (1)

tmfmnk
tmfmnk

Reputation: 39858

One possibility is to use grepl() with fixed = TRUE:

lapply(words, function(word) as.numeric(grepl(word, phrases, fixed = TRUE)))

Alternatively, you can use stri_detect_fixed() from stringi:

lapply(words, function(word) as.numeric(stri_detect_fixed(phrases, word)))

A small simulation:

phrases <- rep(phrases, 100000)

library(microbenchmark)
microbenchmark(grepl = lapply(words, function(word) as.numeric(grepl(word, phrases))),
               grepl_fixed = lapply(words, function(word) as.numeric(grepl(word, phrases, fixed = TRUE))),
               stri_detect_fixed = lapply(words, function(word) as.numeric(stri_detect_fixed(phrases, word))),
               times = 50)

Unit: milliseconds
              expr      min       lq      mean   median       uq       max neval
             grepl 857.5839 918.3976 1007.4775 957.3126 986.9762 1631.5336    50
       grepl_fixed 116.8073 130.1615  146.6852 139.1170 152.0428  278.1512    50
 stri_detect_fixed 105.2338 116.9041  128.8941 126.7353 135.7818  199.4968    50

As proposed by @akrun, some performance improvement could be achieved by replacing as.numeric() with a +:

microbenchmark(grepl_plus = lapply(words, function(word) +grepl(word, phrases)),
               grepl_fixed_plus = lapply(words, function(word) +grepl(word, phrases, fixed = TRUE)),
               stri_detect_fixed_plus = lapply(words, function(word) +stri_detect_fixed(phrases, word)),
               grepl_as_numeric = lapply(words, function(word) as.numeric(grepl(word, phrases))),
               grepl_fixed_as_numeric = lapply(words, function(word) as.numeric(grepl(word, phrases, fixed = TRUE))),
               stri_detect_fixed_as_numeric = lapply(words, function(word) as.numeric(stri_detect_fixed(phrases, word))),
               times = 50)

Unit: milliseconds
                         expr      min       lq      mean   median        uq       max
                   grepl_plus 839.2060 889.8748 1008.0753 926.4712 1022.6071 2063.8296
             grepl_fixed_plus 117.0043 126.4407  141.5917 136.5732  146.2262  318.7412
       stri_detect_fixed_plus 104.4772 110.3147  126.3931 115.9223  124.4952  423.4654
             grepl_as_numeric 851.4198 893.6703  957.4348 935.0965 1010.3131 1375.0810
       grepl_fixed_as_numeric 121.8952 128.6741  142.4962 136.3370  145.5004  235.6042
 stri_detect_fixed_as_numeric 106.0639 114.6759  128.0724 121.9647  135.4791  191.1315
 neval
    50
    50
    50
    50
    50
    50

Upvotes: 4

Related Questions