Maximum occurrence of any set of words in text in R

Question

Given a set of lines, I have to find maximum occurrence of words(need not be single word, can be set of words also.)

say, I have a text like,

string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend"

I want output to be,

john beck - 3
chemical engineer - 2

Is there any function or package which does this?

lukeA · Accepted Answer

Try this:

string <- "He is john beck. john beck is working as an chemical engineer. Most of the chemical engineers are john beck's friend"
library(tau)
library(tm)
tokens <- MC_tokenizer(string) 
tokens <- tokens[tokens != ""]
string_ <- paste(stemCompletion(stemDocument(tokens), tokens), collapse = " ")

## if you want only bi-grams: 
tab <- sort(textcnt(string_, method = "string", n = 2), decreasing = TRUE)
data.frame(Freq = tab[tab > 1])
#                   Freq
# john beck            3
# chemical engineer    2

## if you want uni-, bi- and tri-grams: 
nmin <- 1; nmax <- 3
tab <- sort(do.call(c, lapply(nmin:nmax, function(x) textcnt(string_, method = "string", n = x) )), decreasing = TRUE)
data.frame(Freq = tab[tab > 1])
#                   Freq
# beck                 3
# john                 3
# john beck            3
# chemical             2
# engineer             2
# is                   2
# chemical engineer    2

Maximum occurrence of any set of words in text in R

Answers (2)

Related Questions