Erik Bodg
Erik Bodg

Reputation: 282

Make the preprocessing of a dfm in the input column without the need to create the dfm

Having a dataframe like this

dataf <- data.frame(id = c(1,2,3,4), text = c("Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s","Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now","There are many variations of passages of Lorem Ipsum available, but the majority have suffered alteration in some form, by injected humour","a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum",""))

It is possible to make a text analysis preprocessing using the construction of dfm

myDfm <- myCorpus %>%
     tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE)  %>%
     tokens_remove(pattern = c(stopwords(source = "smart"), mystopwords))  %>% tokens_wordstem() %>% 
     dfm(verbose = FALSE) %>% dfm_trim(min_docfreq = 3, min_termfreq = 5)

Is there any alternative option to remover the stopwords stopwords(source = "smart"), make the wordstem and make the trim min_docfreq = 3, min_termfreq = 5 in the text column without need to create the dfm?

Upvotes: 0

Views: 164

Answers (1)

Ken Benoit
Ken Benoit

Reputation: 14902

I'll answer this based on the question plus comment, since it seems you need a dgCMatrix class for what you want to do. (This is what is returned by textmineR::CreateDtm().) Fortunately, a quanteda dfm is a special type of dgCMatrix already. So it would probably work as is, but if you want, it's also easy to convert -- just use as().

library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.
data(nih_sample, package = "textmineR")

dfmat <- nih_sample %>%
  corpus(text_field = "ABSTRACT_TEXT", docid_field = "APPLICATION_ID") %>%
  tokens() %>%
  tokens_ngrams(n = 1:2) %>%
  dfm()
dtm2 <- as(dfmat, "dgCMatrix")

Now, dtm2 should work same as dtm as in the blog post. (The features/columns are in a different order, but that should not matter for a matrix that will be input to a topic model.) And: it's a WHOLE lot cleaner process.

Feel free here to insert additional tokens() options or dfm_trim() etc as you need from quanteda.

Upvotes: 3

Related Questions