Reputation: 11
I have been trying to do topic modeling on a collection of discussion forum posts in a MOOC. I have tried basic LDA to create topics, and the topics were meaningless. So now I'm looking into seeding my topics to create better topics. I found the seededlda package, which requires a dfm as an input as well as a dictionary of seeded terms. It works well! My issue is figuring out how each document, or forum post, is categorized.
My original data has "userid" as a variable and "post" as the document I'm using for LDA. So far my code looks like this.
text <- introduction_posts$post
dfmt <- dfm(text, remove_number = TRUE) %>%
dfm_remove(stopwords('en'), min_nchar = 2)
#install.packages("seededlda")
library(seededlda)
slda <- textmodel_seededlda(dfmt,
seeded_dict,
valuetype = c("glob", "regex", "fixed"),
case_insensitive = FALSE,
residual = TRUE,
weight = 0.01,
max_iter = 2000,
alpha = NULL,
beta = NULL,
verbose = quanteda_options("verbose")
)
terms <- terms(slda)
How can I determine which terms go to which user?
When I used the LDA function under the topicmodeling package I used a document term matrix defined this way
posts_dtm <- CreateDtm(doc_vec = introduction_posts$post, # character vector of documents
doc_names = introduction_posts$userid_bycourse, # document names
ngram_window = c(1, 2), # minimum and maximum n-gram length
stopword_vec = c(stopwords::stopwords("en"), # stopwords from tm
stopwords::stopwords(source = "smart"))
which named the documents as it went along. In the end I was able to nicely see which topics went to which participants. But I can't seem to do that with the dfm that the seededlda package uses.
Any help would be appreciated.
Upvotes: 0
Views: 446
Reputation: 890
It seems to me that it is more about how to construct dfm using quanteda than seededlda.
dat <- data.frame(user = c("user1", "user2", "user3", "user4", "user5"),
post = c("a f", "b dd", "e g", "g a", "f b"))
dat
# user post
# 1 user1 a f
# 2 user2 b dd
# 3 user3 e g
# 4 user4 g a
# 5 user5 f b
corp <- corpus(dat, docid_field = "user", text_field = "post")
dfmt <- dfm(corp)
dfmt
# Document-feature matrix of: 5 documents, 6 features (66.7% sparse).
# features
# docs a f b dd e g
# user1 1 1 0 0 0 0
# user2 0 0 1 1 0 0
# user3 0 0 0 0 1 1
# user4 1 0 0 0 0 1
# user5 0 1 1 0 0 0
As for seededlda, its topics()
does not return a vector with document names but you can give names.
topic <- topics(slda)
names(topic) <- docnames(dfmt)
Upvotes: 0