Form bigrams without stopwords in R

Question

I have some trouble with bigram in text mining using R recently. The purpose is to find the meaningful keywords in news, for example are "smart car" and "data mining".

Let's say if I have a string as follows:

"IBM have a great success in the computer industry for the past decades..."

After removing stopwords("have","a","in","the","for"),

"IBM great success computer industry past decades..."

In a result, bigrams like "success computer" or "industry past" will occur.

But what I really need is that there exist no stopwords between two words, like "computer industry" is a clear example of bigram for what I want.

The part of my code is below:

corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace) 
corpus <- tm_map(corpus, stemDocument)
NgramTokenizer = function(x) {unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)}
dtm <- TermDocumentMatrix(corpus, control = list(tokenize = NgramTokenizer))

Is there any method to avoid the result with words like "success computer" when TF counting?

Form bigrams without stopwords in R

Answers (1)

Related Questions