Reputation: 1624

Count number of words in one list that appear in a string

I have a unique set of words in a character vector (that have been 'stemmed') and I want to know how many of them appear in a string.

Here's what I have so far:

library(RTextTools)

string <- "Players Information donation link controller support years fame glory addition champion Steer leader gang ghosts life Power Pellets tables gobble ghost"
wordstofind <- c("player","fame","field","donat")

# I created a stemmed list of the string
string.stem <- colnames(create_matrix(string, stemWords = T, removeStopwords = F))

I know the next step probably involves grepl("\\bword\\b,value") or some usage of regex, but I'm not sure what the fastest option is in this case.

Here are my criteria:

I have to do this many times, so it being as fast as possible is a concern.
It should match the entire word ("es" shouldn't match "test").

Any push in the right direction would be great.

Upvotes: 0

Answers (3)

Andrew Taylor

Reputation: 3488

Well, I never work with huge datasets, so time is never of the essence, but given the data you've provided this will give you a count of how many words exactly match something in the string. Might be a good starting point.

sum(wordstofind %in% unlist(strsplit(string, " ")))

> sum(wordstofind %in% unlist(strsplit(string, " ")))
[1] 1

Edit Using the stems to get the proper 3 matches, thanks to @Anthony Bissel:

sum(wordstofind %in% unlist(string.stem))

> sum(wordstofind %in% unlist(string.stem))
[1] 3

Upvotes: 2

Optimus

Reputation: 1624

There certainly might be a faster option, but this works:

length(wordstofind) - length(setdiff(wordstofind, string.stem)) # 3

But it looks like Andrew Taylor's answer is faster:

`microbenchmark(sum(wordstofind %in% unlist(string.stem)), length(wordstofind) - length(setdiff(wordstofind, string.stem)))
Unit: microseconds
                                                        expr    min     lq     mean median     uq    max neval
                   sum(wordstofind %in% unlist(string.stem))  4.016  4.909  6.55562  5.355  5.801 37.485   100
length(wordstofind) - length(setdiff(wordstofind, string.stem)) 16.511 16.958 21.85303 17.404 18.296 81.218   100`

Upvotes: 0

kristang

Reputation: 557

Take a look at stringr by Hadley Wickham. You are probably looking for the function str_count.

Upvotes: 2

Count number of words in one list that appear in a string

Answers (3)

Related Questions