Mcguns
Mcguns

Reputation: 35

Text Mining in R: Counting 2-3 word phrases

I found a very useful piece of code within Stackoverflow - Finding 2 & 3 word Phrases Using R TM Package (credit @patrick perry) to show the frequency of 2 and 3 word phrases within a corpus:

library(corpus)
corpus <- gutenberg_corpus(55) # Project Gutenberg #55, _The Wizard of Oz_
text_filter(corpus)$drop_punct <- TRUE # ignore punctuation
term_stats(corpus, ngrams = 2:3)
##    term             count support
## 1  of the             336       1
## 2  the scarecrow      208       1
## 3  to the             185       1
## 4  and the            166       1
## 5  said the           152       1
## 6  in the             147       1
## 7  the lion           141       1
## 8  the tin            123       1
## 9  the tin woodman    114       1
## 10 tin woodman        114       1
## 11 i am                84       1
## 12 it was              69       1
## 13 in a                64       1
## 14 the great           63       1
## 15 the wicked          61       1
## 16 wicked witch        60       1
## 17 at the              59       1
## 18 the little          59       1
## 19 the wicked witch    58       1
## 20 back to             57       1
## ⋮  (52511 rows total)

How do you ensure that frequency counts of phrases like "the tin" are not also included in the frequency count of "the tin woodman" or the "tin woodman"?

Thanks

Upvotes: 0

Views: 433

Answers (1)

hello_friend
hello_friend

Reputation: 5788

Removing stopwords can remove noise from the data, causing issues such as those you are having a above:

library(tm)
library(corpus)
library(dplyr)
corpus <- Corpus(VectorSource(gutenberg_corpus(55)))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
term_stats(corpus, ngrams = 2:3) %>% 
  arrange(desc(count)) %>%
  group_by(grp = str_extract(as.character(term), "\\w+\\s+\\w+")) %>% 
  mutate(count_unique = ifelse(length(unique(count)) > 1, max(count) - min(count), count)) %>% 
  ungroup() %>% 
  select(-grp)

Upvotes: 1

Related Questions