How to find frequency of n-grams and visualize it in wordcloud using R?

Question

I have dataframe with a column which includes strings of text, which I would like to do some analysis on. I would like to know what are the most used words and visualize this in a wordcloud. For single words (unigrams) I've managed to do so, but I'm failing in making my code work for n-grams (e.g. bigrams, trigrams). Here I've included my code for the unigrams. I'm open to adjusting my code to make it work, or to have a complete new piece of code. How would I best approach this?

library(wordcloud)
library(RColorBrewer)
library(wordcloud2)
library(tm)
library(stringr)

#Delete special characters and lower text
df$text <- str_replace_all(df$text, "[^[:alnum:]]", " ")
df$text <- tolower(df$text)

#From df to Corpus
corpus <- Corpus(VectorSource(df))

#Remove english stopwords, 
stopwords<-c(stopwords("english"))
corpus <- tm_map(corpus, removeWords,stopwords)
rm(stopwords)

#Make term document matrix
tdm <- TermDocumentMatrix(corpus,control=list(wordLenths=c(1,Inf)))

#Make list of most frequent words
tdm_freq <- as.matrix(tdm) 
words <- sort(rowSums(tdm_freq),decreasing=TRUE) 
tdm_freq <- data.frame(word = names(words),freq=words)
rm(words)

#Make a wordcloud
wordcloud2(tdm_freq, size = 0.4, minSize = 10, gridSize =  0,
           fontFamily = 'Segoe UI', fontWeight = 'normal',
           color = 'red', backgroundColor = "white",
           minRotation = -pi/4, maxRotation = pi/4, shuffle = TRUE,
           rotateRatio = 0.4, shape = 'circle', ellipticity = 0.8,
           widgetsize = NULL, figPath = NULL, hoverFunction = NULL)

LRRR · Accepted Answer

Change Corpus to VCorpus so tokenising will work.

# Data
df <- data.frame(text = c("I have dataframe with a column I have dataframe with a column", 
                          "I would like to know what are the most I would like to know what are the most", 
                          "For single words (unigrams) I've managed to do so For single words (unigrams) I've managed to do so",
                          "Here I've included my code for the unigrams Here I've included my code for the unigrams"))

# VCorpus
corpus <- VCorpus(VectorSource(df))
funs <- list(stripWhitespace,
             removePunctuation,
             function(x) removeWords(x, stopwords("english")),
             content_transformer(tolower))
corpus <- tm_map(corpus, FUN = tm_reduce, tmFuns = funs)

# Tokenise data without requiring any particular package
ngram_token <-  function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse=" "), use.names=FALSE)

# Pass into TDM control argument
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = ngram_token))
freq <- rowSums(as.matrix(tdm))
tdm_freq <- data.frame(term = names(freq), occurrences = freq)
tdm_freq


                               term occurrences
code unigrams         code unigrams           2
column dataframe   column dataframe           1
column like             column like           1
dataframe column   dataframe column           2
included code         included code           2
...

How to find frequency of n-grams and visualize it in wordcloud using R?

Answers (1)

Related Questions