Abhishek Gangwar
Abhishek Gangwar

Reputation: 1767

how to do LDA in R

My task is to apply LDA on the dataset of amazon reviews and get 50 topics

I have extracted the review text in a vector and now I am trying to apply LDA

I have created the dtm

matrix <- create_matrix(dat, language="english", removeStopwords=TRUE,  stemWords=FALSE, stripWhitespace=TRUE, toLower=TRUE)

<<DocumentTermMatrix (documents: 100000, terms: 174632)>>
Non-/sparse entries: 4096244/17459103756
Sparsity           : 100%
Maximal term length: 218
Weighting          : term frequency (tf)

but when I try to do this I get the following error:

lda <- LDA(matrix, 30)

Error in LDA(matrix, 30) : 
  Each row of the input matrix needs to contain at least one non-zero entry

Searched for some solutions and used slam to

    matrix1 <- rollup(matrix, 2, na.rm=TRUE, FUN = sum)

still getting the same error

I am very new to this can someone help me or suggest me some reference to study about this.It will be very helpful

There are no empty rows in my original matrix and it contains only one column that contain reviews

Upvotes: 1

Views: 3630

Answers (1)

Partha Roy
Partha Roy

Reputation: 1621

I have been assigned with kind of similar task , I am also learning and doing , I have developed somewhat , so i am sharing my code snippet , I hope that will Help.

library("topicmodels")
library("tm")

func<-function(input){

x<-c("I like to eat broccoli and bananas.",
        "I ate a banana and spinach smoothie for breakfast.",

"Chinchillas and kittens are cute.",
"My sister adopted a kitten yesterday.",
"Look at this cute hamster munching on a piece of broccoli.")



#whole file is lowercased
#text<-tolower(x)

#deleting all common words from the text
#text2<-setdiff(text,stopwords("english"))

#splitting the text into vectors where each vector is a word..
#text3<-strsplit(text2," ")

# Generating a structured text i.e. Corpus
docs<-Corpus(VectorSource(x))

creating content transformers i.e functions which will be used to modify objects in R..

toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))

#Removing all the special charecters..

docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
docs <- tm_map(docs, removeNumbers)

# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))

# Remove punctuations
docs <- tm_map(docs, removePunctuation)

# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)

docs<-tm_map(docs,removeWords,c("\t"," ",""))

dtm<- TermDocumentMatrix(docs, control = list(removePunctuation = TRUE, stopwords=TRUE))

    #print(dtm)


freq<-colSums(as.matrix(dtm))   

print(names(freq))


ord<-order(freq,decreasing=TRUE)

write.csv(freq[ord],"word_freq.csv")

Setting parameters for LDA

        burnin<-4000
        iter<-2000
        thin<-500
        seed<-list(2003,5,63,100001,765)
        nstart<-5
        best<-TRUE

        #Number of Topics
        k<-3

# Docs to topics    
    ldaOut<-LDA(dtm,k,method="Gibbs",control=list(nstart=nstart,seed=seed,best=best,burnin=burnin,iter=iter,thin=thin))

    ldaOut.topics<-as.matrix(topics(ldaOut))
    write.csv(ldaOut.topics,file=paste("LDAGibbs",k,"DocsToTopics.csv"))

Upvotes: 1

Related Questions