Implementing n-grams for next word prediction

Question

I'm trying to utilize a trigram for next word prediction.

I have been able to upload a corpus and identify the most common trigrams by their frequencies. I used the "ngrams", "RWeka" and "tm" packages in R. I followed this question for guidance:

What algorithm I need to find n-grams?

text1<-readLines("MyText.txt", encoding = "UTF-8")
corpus <- Corpus(VectorSource(text1))

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max =       3))
tdm <- TermDocumentMatrix(corpus, control = list(tokenize =      BigramTokenizer))

If a user were to input a set a words, how would I go about generating the next word? For example, if a user types "can of", how would would I retrieve the three most likely words (e.g. beer, soda, paint, etc..)?

lukeA · Accepted Answer

Here`s one way as a starter:

f <- function(queryHistoryTab, query, n = 2) {
  require(tau)
  trigrams <- sort(textcnt(rep(tolower(names(queryHistoryTab)), queryHistoryTab), method = "string", n = length(scan(text = query, what = "character", quiet = TRUE)) + 1))
  query <- tolower(query)
  idx <- which(substr(names(trigrams), 0, nchar(query)) == query)
  res <- head(names(sort(trigrams[idx], decreasing = TRUE)), n)
  res <- substr(res, nchar(query) + 2, nchar(res))
  return(res)
}
f(c("Can of beer" = 3, "can of Soda" = 2, "A can of water" = 1, "Buy me a can of soda, please" = 2), "Can of")
# [1] "soda" "beer"

Implementing n-grams for next word prediction

Answers (2)

Related Questions