Reputation: 187
From the stm there is the searchK() option to find the optimal K numbers of a topic modeling using a process like this:
library(stm)
library(quanteda)
library(ggplot2)
temp<-textProcessor(documents=gadarian$open.ended.response,metadata=gadarian)
out <- prepDocuments(temp$documents, temp$vocab, temp$meta)
documents <- out$documents
vocab <- out$vocab
meta <- out$meta
set.seed(02138)
K<-c(5,10,15)
df1 <- searchK(documents, vocab, K, data=meta)
This example in prepDocumenets() makes a specific preprocessing using stemming etc. How is it possible to change this preprocessing and use this dfm option to calculate the searchK()?
myDfm <- gadarian$open.ended.response %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
dfm()
Upvotes: 1
Views: 483
Reputation: 14902
Use the convert(x, to = "stm")
function from quanteda, to get the list that searchK()
needs. So add this:
out <- convert(myDfm, to = "stm")
Then, the same code from above will work:
documents <- out$documents
vocab <- out$vocab
meta <- out$meta
set.seed(02138)
K <- c(5, 10, 15)
df1 <- searchK(documents, vocab, K, data = meta)
Upvotes: 2