Remove words from a dtm

Question

I have created a dtm.

library(tm)

corpus = Corpus(VectorSource(dat$Reviews))
dtm = DocumentTermMatrix(corpus)

I used it to remove rare terms.

dtm = removeSparseTerms(dtm, 0.98)

After removeSparseTermsthere are still some terms in the dtm which are useless for my analysis.

The tm package has a function to remove words. However, this function can only be applied to a corpus or a vector.

How can I remove defined terms from a dtm?

Here is a small sample of the input data:

samp = dat %>%
  select(Reviews) %>%
  sample_n(20)

dput(samp)
structure(list(Reviews = c("buenisimoooooo", "excelente", "excelent", 
"awesome phone awesome price almost month issue highly use blu manufacturer high speed processor blu iphone", 
"phone multiple failure poorly touch screen 2 slot sim card work responsible disappoint brand good team shop store wine money unfortunately precaution purchase", 
"work perfect time", "amaze buy phone smoothly update charm glte yet comparably fast several different provider sims perfectly small size definitely replacemnent simple", 
"phone work card non sim card description", "perfect reliable kinda fast even simple mobile sim digicel never problem far strongly anyone need nice expensive dual sim phone perfect gift love friend", 
"perfect", "great bang buck", "actually happy little sister really first good great picture late", 
"good phone good reception home fringe area screen lovely just right size good buy", 
"", "phone verizon contract phone buyer beware", "good phone", 
"excellent product total satisfaction", "dreadful phone home button never screen unresponsive answer call easily month phone test automatically emergency police round supplier network nothing never electricals amazon good buy locally refund", 
"good phone price fine", "phone star battery little soon yes"
)), row.names = c(12647L, 10088L, 14055L, 3720L, 6588L, 10626L, 
10362L, 1428L, 12580L, 5381L, 10431L, 2803L, 6644L, 12969L, 348L, 
10582L, 3215L, 13358L, 12708L, 7049L), class = "data.frame")

Ken Benoit · Accepted Answer

You should try quanteda, which calls a DocumentTermMatrix a "dfm" (document feature matrix) and has more options to trim it to reduce sparsity, including a function dfm_remove() for removing specific features (terms).

If we rename your samp object as dat, then:

library("quanteda")
## Package version: 1.4.3
## Parallel computing: 2 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

corp <- corpus(dat, text_field = "Reviews")
corp
## Corpus consisting of 20 documents and 0 docvars.
tail(texts(corp), 2)
##                                12708                                 7049 
##              "good phone price fine" "phone star battery little soon yes"

dtm <- dfm(corp)
dtm
## Document-feature matrix of: 20 documents, 128 features (93.6% sparse).

Now we can trim this. For this small one, the sparsity setting of 0.98 has no effect, but we can trim based on frequency thresholds.

# does not actually have an effect
dfm_trim(dtm, sparsity = 0.98, verbose = TRUE)
## Note: converting sparsity into min_docfreq = 1 - 0.98 = NULL .
## No features removed.
## Document-feature matrix of: 20 documents, 128 features (93.6% sparse).

# trimming based on rare terms
dtm <- dfm_trim(dtm, min_termfreq = 3, verbose = TRUE)
## Removing features occurring:
##   - fewer than 3 times: 119
##   Total features removed: 119 (93.0%).
head(dtm)
## Document-feature matrix of: 6 documents, 9 features (83.3% sparse).
## 6 x 9 sparse Matrix of class "dfm"
##        features
## docs    phone screen sim card work good perfect buy never
##   12647     0      0   0    0    0    0       0   0     0
##   10088     0      0   0    0    0    0       0   0     0
##   14055     0      0   0    0    0    0       0   0     0
##   3720      1      0   0    0    0    0       0   0     0
##   6588      1      1   1    1    1    1       0   0     0
##   10626     0      0   0    0    1    0       1   0     0

Anyway to answer your question directly, you want dfm_remove() to get rid of specific features.

# removing from a specific list of terms
dtm <- dfm_remove(dtm, c("screen", "buy", "sim", "card"), verbose = TRUE)
## removed 4 features
## 

dtm
## Document-feature matrix of: 20 documents, 5 features (75.0% sparse).

head(dtm)
## Document-feature matrix of: 6 documents, 5 features (80.0% sparse).
## 6 x 5 sparse Matrix of class "dfm"
##        features
## docs    phone work good perfect never
##   12647     0    0    0       0     0
##   10088     0    0    0       0     0
##   14055     0    0    0       0     0
##   3720      1    0    0       0     0
##   6588      1    1    1       0     0
##   10626     0    1    0       1     0

And finally, if you still really want to, you can convert the dtm into the tm format using quanteda's convert() function:

convert(dtm, to = "tm")
## <>
## Non-/sparse entries: 25/75
## Sparsity           : 75%
## Maximal term length: 7
## Weighting          : term frequency (tf)

Remove words from a dtm

Answers (1)

Related Questions