Abhishek Sourabh
Abhishek Sourabh

Reputation: 101

Classifying new text using LDA in R

I am trying out topic modeling using R for the first time. So, this might be a very dumb question but I am stuck and googling has not given a definitive answer.

Given a corpus of documents, I used the LDA function to identify the different topics in the corpus. Once, the model has been fitted, how can I apply the model on a new batch of documents to classify them among the topics discovered so far?

example code:

data("AssociatedPress", package = "topicmodels")

n <- nrow(AssociatedPress)
train_data <- sample(1:n,0.75*n,replace = FALSE)
AssociatedPress_train <- AssociatedPress[(train_data),]
AssociatedPress_test <- AssociatedPress[!(train_data),]

ap_lda <- LDA(AssociatedPress_train, k = 5, control = list(seed = 1234))

Now, can I classify the documents in AssociatedPress_test using the fitted model ap_lda? If yes, how? If not, what would be the best way to create a model for such future classification?

Upvotes: 5

Views: 1370

Answers (1)

Adam Spannbauer
Adam Spannbauer

Reputation: 2757

You can use the topicmodels::posterior() function as means of finding the "top topic" per new document in your AssociatedPress_test object. Below is a snippet showing how to accomplish this.

# code provided in quesiton------------------------------------------
library(tm)
data("AssociatedPress", package = "topicmodels")

n <- nrow(AssociatedPress)
train_data <- sample(1:n, 0.75*n, replace = FALSE)
AssociatedPress_train <- AssociatedPress[ train_data, ]
AssociatedPress_test  <- AssociatedPress[-train_data, ]

ap_lda <- topicmodels::LDA(AssociatedPress_train, k = 5, 
                           control = list(seed = 1234))
#--------------------------------------------------------------------

#posterior probabilities of topics for each document & terms
post_probs <- topicmodels::posterior(ap_lda, AssociatedPress_test)

#classify documents by finding topic with max prob per doc
top_topic_per_doc <- apply(post$topics, 1, which.max)

head(top_topic_per_doc)

#OUTPUT
# [1] 4 2 4 2 2 2

Upvotes: 5

Related Questions