AnonymousCoward
AnonymousCoward

Reputation: 31

Why does quanteda drop some documents when converting to topicmodels format?

I'm working with quanteda here, and finding that when I convert from a document-feature matrix to topic models I lose some documents. Does anyone know why this is or how I can prevent it? It is causing me some problems in a later section of the analysis. This code starts from the construction of the dfm to the conversion. when I run nrow(dfm_counts2), I get 199,560 rows. But after converting to dtm_lda, there are only 198,435?

dfm_counts <- corpus_raw %>% 
  dfm(tolower = TRUE, remove_punct = TRUE, remove_numbers=TRUE, 
      remove = stopwords_and_single,stem = FALSE,
      remove_separators=TRUE,remove_url =TRUE, remove_symbols = TRUE)
docnames(dfm_counts)<-dfm_counts@docvars$index

## trimming tokens too common or too rare to improve efficiency of modeling
dfm_counts2<-dfm_trim(dfm_counts, max_docfreq = 0.95, min_docfreq=0.005,docfreq_type="prop")
dtm_lda <- convert(dfm_counts2, to = "topicmodels")

Upvotes: 1

Views: 171

Answers (1)

Ken Benoit
Ken Benoit

Reputation: 14902

That's because after your trimming, some documents now consist of zero features. convert(x, to = "topicmodels") removes empty documents, since you cannot fit them in a topic model, and topicmodels::LDA()` produces an error if you try.

In the dfm_trim() call, 199560 - 198435 = 1125 documents must have consisted of features that fall outside your docfreq range.

I suspect that this will be true:

sum(ntoken(dfm_counts2) == 0) == 1125

By the way you can rename the document names by:

docnames(dfm_counts) <- dfm_counts$index

Better to use this operator than access object internals.

Upvotes: 1

Related Questions