Reputation: 185
I have a big dataset of almost 90 columns and about 200k observations. One of the column contains descriptions, so it's only text. However, i have like 100 descriptions that are NAs.
I tried the code of Pablo Barbera from GitHub concerning Topic Models because i need it.
OUTPUT
library(topicmodels)
library(quanteda)
des <- subset(finalMSI, !is.na(description), select=c(description))
corpus_des <- corpus(des$description)
df_des <- dfm(corpus_des, remove=stopwords("spanish"), verbose=TRUE,
remove_punct=TRUE, remove_numbers=TRUE)
cdes <- dfm_trim(df_des, min_docfreq = 2)
# estimate LDA with K topics
K <- 20
lda <- LDA(cdes, k = K, method = "Gibbs",
control = list(verbose=25L, seed = 123, burnin = 100, iter = 500))
Error in LDA(cdes, k = K, method = "Gibbs", control = list(verbose = 25L, : Each row of the input matrix needs to contain at least one non-zero entry
As i don't have any NA in my subset, i don't understand this error message (it's my first time using this package)
Upvotes: 0
Views: 211
Reputation: 14912
It looks like some of your documents are empty, in the sense that they contain no counts of any feature.
You can remove them with:
cdes <- dfm_trim(df_des, min_docfreq = 2) %>%
dfm_subset(ntoken(cdes) > 0)
Upvotes: 1