Reputation: 1767
My task is to apply LDA on the dataset of amazon reviews and get 50 topics
I have extracted the review text in a vector and now I am trying to apply LDA
I have created the dtm
matrix <- create_matrix(dat, language="english", removeStopwords=TRUE, stemWords=FALSE, stripWhitespace=TRUE, toLower=TRUE)
<<DocumentTermMatrix (documents: 100000, terms: 174632)>>
Non-/sparse entries: 4096244/17459103756
Sparsity : 100%
Maximal term length: 218
Weighting : term frequency (tf)
but when I try to do this I get the following error:
lda <- LDA(matrix, 30)
Error in LDA(matrix, 30) :
Each row of the input matrix needs to contain at least one non-zero entry
Searched for some solutions and used slam to
matrix1 <- rollup(matrix, 2, na.rm=TRUE, FUN = sum)
still getting the same error
I am very new to this can someone help me or suggest me some reference to study about this.It will be very helpful
There are no empty rows in my original matrix and it contains only one column that contain reviews
Upvotes: 1
Views: 3630
Reputation: 1621
I have been assigned with kind of similar task , I am also learning and doing , I have developed somewhat , so i am sharing my code snippet , I hope that will Help.
library("topicmodels")
library("tm")
func<-function(input){
x<-c("I like to eat broccoli and bananas.",
"I ate a banana and spinach smoothie for breakfast.",
"Chinchillas and kittens are cute.",
"My sister adopted a kitten yesterday.",
"Look at this cute hamster munching on a piece of broccoli.")
#whole file is lowercased
#text<-tolower(x)
#deleting all common words from the text
#text2<-setdiff(text,stopwords("english"))
#splitting the text into vectors where each vector is a word..
#text3<-strsplit(text2," ")
# Generating a structured text i.e. Corpus
docs<-Corpus(VectorSource(x))
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
#Removing all the special charecters..
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
docs<-tm_map(docs,removeWords,c("\t"," ",""))
dtm<- TermDocumentMatrix(docs, control = list(removePunctuation = TRUE, stopwords=TRUE))
#print(dtm)
freq<-colSums(as.matrix(dtm))
print(names(freq))
ord<-order(freq,decreasing=TRUE)
write.csv(freq[ord],"word_freq.csv")
burnin<-4000
iter<-2000
thin<-500
seed<-list(2003,5,63,100001,765)
nstart<-5
best<-TRUE
#Number of Topics
k<-3
# Docs to topics
ldaOut<-LDA(dtm,k,method="Gibbs",control=list(nstart=nstart,seed=seed,best=best,burnin=burnin,iter=iter,thin=thin))
ldaOut.topics<-as.matrix(topics(ldaOut))
write.csv(ldaOut.topics,file=paste("LDAGibbs",k,"DocsToTopics.csv"))
Upvotes: 1