Heather_B
Heather_B

Reputation: 11

How to use the seededlda package in R to retain identification of users for topics

I have been trying to do topic modeling on a collection of discussion forum posts in a MOOC. I have tried basic LDA to create topics, and the topics were meaningless. So now I'm looking into seeding my topics to create better topics. I found the seededlda package, which requires a dfm as an input as well as a dictionary of seeded terms. It works well! My issue is figuring out how each document, or forum post, is categorized.

My original data has "userid" as a variable and "post" as the document I'm using for LDA. So far my code looks like this.

text <- introduction_posts$post
dfmt <- dfm(text, remove_number = TRUE) %>%
  dfm_remove(stopwords('en'), min_nchar = 2)
#install.packages("seededlda")
library(seededlda)
slda <- textmodel_seededlda(dfmt,
  seeded_dict,
  valuetype = c("glob", "regex", "fixed"),
  case_insensitive = FALSE,
  residual = TRUE,
  weight = 0.01,
  max_iter = 2000,
  alpha = NULL,
  beta = NULL,
  verbose = quanteda_options("verbose")
)
terms <- terms(slda)

How can I determine which terms go to which user?

When I used the LDA function under the topicmodeling package I used a document term matrix defined this way

posts_dtm <- CreateDtm(doc_vec = introduction_posts$post, # character vector of documents
                 doc_names = introduction_posts$userid_bycourse, # document names
                 ngram_window = c(1, 2), # minimum and maximum n-gram length
                 stopword_vec = c(stopwords::stopwords("en"), # stopwords from tm
                                  stopwords::stopwords(source = "smart"))

which named the documents as it went along. In the end I was able to nicely see which topics went to which participants. But I can't seem to do that with the dfm that the seededlda package uses.

Any help would be appreciated.

Upvotes: 0

Views: 446

Answers (1)

Kohei Watanabe
Kohei Watanabe

Reputation: 890

It seems to me that it is more about how to construct dfm using quanteda than seededlda.

dat <- data.frame(user = c("user1", "user2", "user3", "user4", "user5"),
                  post = c("a f", "b dd", "e g", "g a", "f b"))
dat
#    user post
# 1 user1  a f
# 2 user2 b dd
# 3 user3  e g
# 4 user4  g a
# 5 user5  f b

corp <- corpus(dat, docid_field = "user", text_field = "post")
dfmt <- dfm(corp)
dfmt
# Document-feature matrix of: 5 documents, 6 features (66.7% sparse).
#        features
# docs    a f b dd e g
#   user1 1 1 0  0 0 0
#   user2 0 0 1  1 0 0
#   user3 0 0 0  0 1 1
#   user4 1 0 0  0 0 1
#   user5 0 1 1  0 0 0

As for seededlda, its topics() does not return a vector with document names but you can give names.

topic <- topics(slda)
names(topic) <- docnames(dfmt)

Upvotes: 0

Related Questions