Reputation: 1624
I have a unique set of words in a character vector (that have been 'stemmed') and I want to know how many of them appear in a string.
Here's what I have so far:
library(RTextTools)
string <- "Players Information donation link controller support years fame glory addition champion Steer leader gang ghosts life Power Pellets tables gobble ghost"
wordstofind <- c("player","fame","field","donat")
# I created a stemmed list of the string
string.stem <- colnames(create_matrix(string, stemWords = T, removeStopwords = F))
I know the next step probably involves grepl("\\bword\\b,value")
or some usage of regex, but I'm not sure what the fastest option is in this case.
Here are my criteria:
Any push in the right direction would be great.
Upvotes: 0
Views: 157
Reputation: 3488
Well, I never work with huge datasets, so time is never of the essence, but given the data you've provided this will give you a count of how many words exactly match something in the string. Might be a good starting point.
sum(wordstofind %in% unlist(strsplit(string, " ")))
> sum(wordstofind %in% unlist(strsplit(string, " ")))
[1] 1
Edit Using the stems to get the proper 3 matches, thanks to @Anthony Bissel:
sum(wordstofind %in% unlist(string.stem))
> sum(wordstofind %in% unlist(string.stem))
[1] 3
Upvotes: 2
Reputation: 1624
There certainly might be a faster option, but this works:
length(wordstofind) - length(setdiff(wordstofind, string.stem)) # 3
But it looks like Andrew Taylor's answer is faster:
`microbenchmark(sum(wordstofind %in% unlist(string.stem)), length(wordstofind) - length(setdiff(wordstofind, string.stem)))
Unit: microseconds
expr min lq mean median uq max neval
sum(wordstofind %in% unlist(string.stem)) 4.016 4.909 6.55562 5.355 5.801 37.485 100
length(wordstofind) - length(setdiff(wordstofind, string.stem)) 16.511 16.958 21.85303 17.404 18.296 81.218 100`
Upvotes: 0