user1329307
user1329307

Reputation: 161

Removing Stop Phrases from DocumentTermMatrix

Below, I do a basic topic modeling for the "crude" data. I know I can remove stop words using tm_map, but I can't figure out how to do so after the bigram tokenization occurs.

library(topicmodels)
library(tm)
library(RWeka)
library(ggplot2)
library(dplyr)
library(tidytext)

data("crude")
words <- tm_map(crude, content_transformer(tolower))
words <- tm_map(words, removePunctuation)
words <- tm_map(words, stripWhitespace)

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))

#bigram tokenization
dtm <- DocumentTermMatrix(words,control = list(tokenize = BigramTokenizer))
ui = unique(dtm$i) 
dtm = dtm[ui,] #remove "empty" tweets

lda <- LDA(dtm, k = 2,control = list(seed = 7272))

topics <- tidy(lda, matrix = "beta")

##Graphs
top_terms <- topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder(term, beta)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

#single
stopwords1<- stopwords("english") ##I actually use a custom list: read.csv("stopwords.txt", header = FALSE)
adnlstopwords1<-c("ny","new","york","yorks","state","nyc","nys")

#doubles
stopwords2<-levels(interaction(stopwords1,stopwords1,sep=' '))
adnlstopwords2<-c(stopwords2,c("new york", "york state", "in ny", "in new",
                  "new yorks"))

stopwords<-c(stopwords,adnlstopwords1,stopwords2,adnlstopwords2)

My question is how to remove these bigrams from the dtm and not using tm_map or what work-around there might be. Note that the "new york" based bigrams might not occur in the crude data, but are important to my other data.

Upvotes: 0

Views: 777

Answers (1)

user1329307
user1329307

Reputation: 161

I came across this solution from the "gofastR" package in R:

dtm2 <- remove_stopwords(dtm, stopwords = stopwords)

However, I still saw stop phrases in the results. After reviewing the documentation, remove_stopwords assumes it has a sorted list -- you can prep your stopwords/phrases using the prep_stopwords() function from the same package.

stopwords<-prep_stopwords(stopwords)
dtm2 <- remove_stopwords(dtm, stopwords = stopwords)

In order to do this and stem. We can perform the stemming in the tm_map part of the code and remove the stepwords as follows:

stopwords<-prep_stopwords(stemDocument(stopwords))
dtm2 <- remove_stopwords(dtm, stopwords = stopwords)

as this will stem the stopwords which will then match the already stemmed words in the dtm.

Upvotes: 1

Related Questions