Reputation: 1238
Trying to use this option into a large number of words and text:
# Create some fake data
words <- c("stock", "revenue", "continuous improvement")
phrases <- c("blah blah stock and revenue", "yada yada revenue yada",
"continuous improvement is an unrealistic goal",
"phrase with no match")
# Apply the 'grepl' function along the list of words, and convert the result to numeric
df <- data.frame(lapply(words, function(word) {as.numeric(grepl(word, phrases))}))
# Name the columns the words that were searched
names(df) <- words
It takes to much time to implement into large lists and input text
Is there any way to change it to make the process faster?
Upvotes: 2
Views: 233
Reputation: 39858
One possibility is to use grepl()
with fixed = TRUE
:
lapply(words, function(word) as.numeric(grepl(word, phrases, fixed = TRUE)))
Alternatively, you can use stri_detect_fixed()
from stringi
:
lapply(words, function(word) as.numeric(stri_detect_fixed(phrases, word)))
A small simulation:
phrases <- rep(phrases, 100000)
library(microbenchmark)
microbenchmark(grepl = lapply(words, function(word) as.numeric(grepl(word, phrases))),
grepl_fixed = lapply(words, function(word) as.numeric(grepl(word, phrases, fixed = TRUE))),
stri_detect_fixed = lapply(words, function(word) as.numeric(stri_detect_fixed(phrases, word))),
times = 50)
Unit: milliseconds
expr min lq mean median uq max neval
grepl 857.5839 918.3976 1007.4775 957.3126 986.9762 1631.5336 50
grepl_fixed 116.8073 130.1615 146.6852 139.1170 152.0428 278.1512 50
stri_detect_fixed 105.2338 116.9041 128.8941 126.7353 135.7818 199.4968 50
As proposed by @akrun, some performance improvement could be achieved by replacing as.numeric()
with a +
:
microbenchmark(grepl_plus = lapply(words, function(word) +grepl(word, phrases)),
grepl_fixed_plus = lapply(words, function(word) +grepl(word, phrases, fixed = TRUE)),
stri_detect_fixed_plus = lapply(words, function(word) +stri_detect_fixed(phrases, word)),
grepl_as_numeric = lapply(words, function(word) as.numeric(grepl(word, phrases))),
grepl_fixed_as_numeric = lapply(words, function(word) as.numeric(grepl(word, phrases, fixed = TRUE))),
stri_detect_fixed_as_numeric = lapply(words, function(word) as.numeric(stri_detect_fixed(phrases, word))),
times = 50)
Unit: milliseconds
expr min lq mean median uq max
grepl_plus 839.2060 889.8748 1008.0753 926.4712 1022.6071 2063.8296
grepl_fixed_plus 117.0043 126.4407 141.5917 136.5732 146.2262 318.7412
stri_detect_fixed_plus 104.4772 110.3147 126.3931 115.9223 124.4952 423.4654
grepl_as_numeric 851.4198 893.6703 957.4348 935.0965 1010.3131 1375.0810
grepl_fixed_as_numeric 121.8952 128.6741 142.4962 136.3370 145.5004 235.6042
stri_detect_fixed_as_numeric 106.0639 114.6759 128.0724 121.9647 135.4791 191.1315
neval
50
50
50
50
50
50
Upvotes: 4