Reputation: 39
Hello I have a document term matrix and I transformed it with the tidy()
function and it works perfect. I want to plot a word cloud based on the frequency of a word. So my transformed table looks like this:
> head(Wcloud.Data)
# A tibble: 6 x 3
document term count
<chr> <chr> <dbl>
1 1 accept 1
2 1 access 1
3 1 accomplish 1
4 1 account 4
5 1 accur 2
6 1 achiev 1
I have 33,647,383 observations so its a very big dataframe. If I use the max()
function I am getting a very high number (64116) but no word in my dataframe has a frequency of 64116. Also if I plot the dataframe in shiny with wordcloud()
it plots same words several times. Also if I want to sort my column count
its not working - sort(Wcloud.Data$count,decreasing = TRUE)
. So something is not correct but I dont know, what and how to solve it. Somebody has any idea?
Thas the summary of my document term matrix, before transform it into a dataframe:
> observations.tf
<<DocumentTermMatrix (documents: 76717, terms: 4234)>>
Non-/sparse entries: 33647383/291172395
Sparsity : 90%
Maximal term length: 15
Weighting : term frequency (tf)
Update: I add a picture of my dataframe
Upvotes: 0
Views: 1070
Reputation: 2829
Using dplyr
you can do as following:
library("tm")
library("SnowballC")
library("wordcloud")
library("RColorBrewer")
Wcloud.Data<- data.frame(Document= c(rep(1,6)),
term = c("accept", "access","accomplish", "account", "accur", "achiev"),
count = c(1,1,1,4,2,1))
Data<-Wcloud.Data %>%
group_by(term) %>%
summarise(Frequency = sum(count))
set.seed(1234)
wordcloud(words = Data$term, freq = Data$Frequency, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
On the other side, libraries quanteda
and tibble
can help you creting the term frequency matrix. I will put you an example to work with it:
library(tibble)
library(quanteda)
Data <- data_frame(text = c("Chinese Beijing Chinese",
"Chinese Chinese Shanghai",
"this is china",
"china is here",
'hello china',
"Chinese Beijing Chinese",
"Chinese Chinese Shanghai",
"this is china",
"china is here",
'hello china',
"Kyoto Japan",
"Tokyo Japan Chinese",
"Kyoto Japan",
"Tokyo Japan Chinese",
"Kyoto Japan",
"Tokyo Japan Chinese",
"Kyoto Japan",
"Tokyo Japan Chinese",
'japan'))
DocTerm <- quanteda::dfm(Data$text)
DocTerm
# Document-feature matrix of: 19 documents, 11 features (78.5% sparse).
# 19 x 11 sparse Matrix of class "dfm"
# features
# docs chinese beijing shanghai this is china here hello kyoto japan tokyo
# text1 2 1 0 0 0 0 0 0 0 0 0
# text2 2 0 1 0 0 0 0 0 0 0 0
# text3 0 0 0 1 1 1 0 0 0 0 0
# text4 0 0 0 0 1 1 1 0 0 0 0
# text5 0 0 0 0 0 1 0 1 0 0 0
# text6 2 1 0 0 0 0 0 0 0 0 0
# text7 2 0 1 0 0 0 0 0 0 0 0
# text8 0 0 0 1 1 1 0 0 0 0 0
# text9 0 0 0 0 1 1 1 0 0 0 0
# text10 0 0 0 0 0 1 0 1 0 0 0
# text11 0 0 0 0 0 0 0 0 1 1 0
# text12 1 0 0 0 0 0 0 0 0 1 1
# text13 0 0 0 0 0 0 0 0 1 1 0
# text14 1 0 0 0 0 0 0 0 0 1 1
# text15 0 0 0 0 0 0 0 0 1 1 0
# text16 1 0 0 0 0 0 0 0 0 1 1
# text17 0 0 0 0 0 0 0 0 1 1 0
# text18 1 0 0 0 0 0 0 0 0 1 1
# text19 0 0 0 0 0 0 0 0 0 1 0
Mat<-quanteda::convert(DocTerm,"data.frame")[,2:ncol(DocTerm)] # Converting to a Dataframe without taking into account the text variable
Result<- colSums(Mat) # This is what you are interested in
names(Result)<-colnames(Mat)
# > Result
# chinese beijing shanghai this is china here hello kyoto japan
# 24 4 4 4 8 12 4 4 8 18
Upvotes: 1