Reputation: 175
I have dataframe with a column which includes strings of text, which I would like to do some analysis on. I would like to know what are the most used words and visualize this in a wordcloud. For single words (unigrams) I've managed to do so, but I'm failing in making my code work for n-grams (e.g. bigrams, trigrams). Here I've included my code for the unigrams. I'm open to adjusting my code to make it work, or to have a complete new piece of code. How would I best approach this?
library(wordcloud)
library(RColorBrewer)
library(wordcloud2)
library(tm)
library(stringr)
#Delete special characters and lower text
df$text <- str_replace_all(df$text, "[^[:alnum:]]", " ")
df$text <- tolower(df$text)
#From df to Corpus
corpus <- Corpus(VectorSource(df))
#Remove english stopwords,
stopwords<-c(stopwords("english"))
corpus <- tm_map(corpus, removeWords,stopwords)
rm(stopwords)
#Make term document matrix
tdm <- TermDocumentMatrix(corpus,control=list(wordLenths=c(1,Inf)))
#Make list of most frequent words
tdm_freq <- as.matrix(tdm)
words <- sort(rowSums(tdm_freq),decreasing=TRUE)
tdm_freq <- data.frame(word = names(words),freq=words)
rm(words)
#Make a wordcloud
wordcloud2(tdm_freq, size = 0.4, minSize = 10, gridSize = 0,
fontFamily = 'Segoe UI', fontWeight = 'normal',
color = 'red', backgroundColor = "white",
minRotation = -pi/4, maxRotation = pi/4, shuffle = TRUE,
rotateRatio = 0.4, shape = 'circle', ellipticity = 0.8,
widgetsize = NULL, figPath = NULL, hoverFunction = NULL)
Upvotes: 0
Views: 879
Reputation: 456
Change Corpus
to VCorpus
so tokenising will work.
# Data
df <- data.frame(text = c("I have dataframe with a column I have dataframe with a column",
"I would like to know what are the most I would like to know what are the most",
"For single words (unigrams) I've managed to do so For single words (unigrams) I've managed to do so",
"Here I've included my code for the unigrams Here I've included my code for the unigrams"))
# VCorpus
corpus <- VCorpus(VectorSource(df))
funs <- list(stripWhitespace,
removePunctuation,
function(x) removeWords(x, stopwords("english")),
content_transformer(tolower))
corpus <- tm_map(corpus, FUN = tm_reduce, tmFuns = funs)
# Tokenise data without requiring any particular package
ngram_token <- function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse=" "), use.names=FALSE)
# Pass into TDM control argument
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = ngram_token))
freq <- rowSums(as.matrix(tdm))
tdm_freq <- data.frame(term = names(freq), occurrences = freq)
tdm_freq
term occurrences
code unigrams code unigrams 2
column dataframe column dataframe 1
column like column like 1
dataframe column dataframe column 2
included code included code 2
...
Upvotes: 1