Please have a look at the self-contained example at the end of the post. I simplified the reprex and you can download the dfm (document-feature matrix) from
A couple of things which I do not understand happen
What causes 'subscript out of bounds' error in STM topic modeling with missing data?
but here I give a reproducible example.
Any help for 1) and 2) is appreciated!
## Download the dfm matrix from
dfm_mat <- readRDS("dfm_mat.RDS")
## see
## convert the dfm to a format suitable to stm.
dfm2stm <- convert(dfm_mat, to = "stm")
model.stm <- stm(dfm2stm$documents, dfm2stm$vocab, K = 9, data = dfm2stm$meta,
init.type = "Spectral")
## I make the model tidy.
## See
stm_tidy <- tidy(model.stm)
gpl <- stm_tidy |>
group_by(topic) |>
top_n(10, beta) |>
ungroup() |>
mutate(topic = paste0("Topic ", topic),
term = reorder_within(term, beta, topic)) |>
ggplot(aes(term, beta, fill = as.factor(topic))) +
geom_col(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
labs(x = NULL, y = expression(beta),
title = "Highest word probabilities for each topic",
subtitle = "Different words are associated with different topics")
## I can fit a model by stm with a chosen number of topics to the data
### Now I try determining the optimal number of topics using the searchK function
### See
K <- 5:15
model_search <- searchK(dfm2stm$documents, dfm2stm$vocab, K,
data = dfm2stm$meta)
#> Error in missing$docs[[i]]: subscript out of bounds
## This fails but I do not understand why....
I think what is happening is this: With only three documents in your dfm_mat
, the searchK()
is trying by default to drop half of them to use for a held-out set. This is causing many features to be zero, which means they are dropped from the vocab by default in estimating the topic models used in searchK()
. stm()
needs only non-zero features, but searchK()
considers the vocab
set to be fixed, so it's breaking some code inside the function. (I did not check this in the code however.)
> sum(colSums(dfm_sample(dfm_mat, size = 2)) == 0)
[1] 603
> sum(colSums(dfm_sample(dfm_mat, size = 2)) == 0)
[1] 583
> sum(colSums(dfm_sample(dfm_mat, size = 2)) == 0)
[1] 582
These are the three sample options for dropping 1 of the 3 documents (0.50 rounded up).
You would need to contact the stm package maintainers about a potential bug report. Or, for your problem, use more documents and trim those with low frequencies.
