Reputation: 23
Here is the code I use to create bi-grams with frequency list:
#data <- myData[,2]
tdm.generate <- function(string, ng){
# tutorial on rweka -
corpus <- Corpus(VectorSource(string)) # create corpus for TM processing
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
# corpus <- tm_map(corpus, removeWords, stopwords("english"))
options(mc.cores=1) #
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = ng, max = ng)) # create n-grams
tdm <- TermDocumentMatrix(corpus, control = list(tokenize = BigramTokenizer)) # create tdm from n-grams
source("GenerateTDM.R") # generatetdm function in appendix
tdm <- tdm.generate("The book The book The greatest The book",2)
tdm.matrix <- as.matrix(tdm)
topwords <- rowSums(tdm.matrix)
topwords <- as.numeric(topwords)
hist(topwords, breaks = 10)
tdm.matrix <- as.matrix(tdm)
topwords <- rowSums(tdm.matrix)
head(sort(topwords, decreasing = TRUE))
The result for the above code is:
the book greatest
4 3 1
Instead, I'm looking for the result where bi-grams are shown like:
"the book" "book the"
3 2
What needs to be changed in the above code to get the output as above?
Upvotes: 2
Views: 514
Reputation: 11
You need to use VCorpus instead of Corpus, I was having the same issue you could check more details here
Upvotes: 1