Reputation: 107
I have some trouble with bigram in text mining using R recently. The purpose is to find the meaningful keywords in news, for example are "smart car" and "data mining".
Let's say if I have a string as follows:
"IBM have a great success in the computer industry for the past decades..."
After removing stopwords("have","a","in","the","for"),
"IBM great success computer industry past decades..."
In a result, bigrams like "success computer" or "industry past" will occur.
But what I really need is that there exist no stopwords between two words, like "computer industry" is a clear example of bigram for what I want.
The part of my code is below:
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument)
NgramTokenizer = function(x) {unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)}
dtm <- TermDocumentMatrix(corpus, control = list(tokenize = NgramTokenizer))
Is there any method to avoid the result with words like "success computer" when TF counting?
Upvotes: 2
Views: 6198
Reputation: 14902
Note: Edited 2017-10-12 to reflect new quanteda syntax.
You can do this in quanteda, which can remove stop words from ngrams after they have been formed.
txt <- "IBM have a great success in the computer industry for the past decades..."
library("quanteda")
myDfm <- tokens(txt) %>%
tokens_remove("\\p{P}", valuetype = "regex", padding = TRUE) %>%
tokens_remove(stopwords("english"), padding = TRUE) %>%
tokens_ngrams(n = 2) %>%
dfm()
featnames(myDfm)
# [1] "great_success" "computer_industry" "past_decades"
What it does:
To get a count of these bigrams, you can either inspect the dfm directly, or use topfeatures()
:
myDfm
# Document-feature matrix of: 1 document, 3 features.
# 1 x 3 sparse Matrix of class "dfmSparse"
# features
# docs great_success computer_industry past_decades
# text1 1 1 1
topfeatures(myDfm)
# great_success computer_industry past_decades
# 1 1 1
Upvotes: 3