Reputation: 31
I'm working with quanteda here, and finding that when I convert from a document-feature matrix to topic models I lose some documents. Does anyone know why this is or how I can prevent it? It is causing me some problems in a later section of the analysis. This code starts from the construction of the dfm to the conversion. when I run nrow(dfm_counts2), I get 199,560 rows. But after converting to dtm_lda, there are only 198,435?
dfm_counts <- corpus_raw %>%
dfm(tolower = TRUE, remove_punct = TRUE, remove_numbers=TRUE,
remove = stopwords_and_single,stem = FALSE,
remove_separators=TRUE,remove_url =TRUE, remove_symbols = TRUE)
docnames(dfm_counts)<-dfm_counts@docvars$index
## trimming tokens too common or too rare to improve efficiency of modeling
dfm_counts2<-dfm_trim(dfm_counts, max_docfreq = 0.95, min_docfreq=0.005,docfreq_type="prop")
dtm_lda <- convert(dfm_counts2, to = "topicmodels")
Upvotes: 1
Views: 171
Reputation: 14902
That's because after your trimming, some documents now consist of zero features. convert(x, to = "topicmodels") removes empty documents, since you cannot fit them in a topic model, and
topicmodels::LDA()` produces an error if you try.
In the dfm_trim()
call, 199560 - 198435 = 1125 documents must have consisted of features that fall outside your docfreq range.
I suspect that this will be true:
sum(ntoken(dfm_counts2) == 0) == 1125
By the way you can rename the document names by:
docnames(dfm_counts) <- dfm_counts$index
Better to use this operator than access object internals.
Upvotes: 1