Reputation: 2869
I've got a document-term matrix from the tm
package in R.
dd <- Corpus(VectorSource(train$text)) #Make a corpus object from a text vector
#Clean the text
dd <- tm_map(dd, stripWhitespace)
dd <- tm_map(dd, tolower)
dd <- tm_map(dd, removePunctuation)
dd <- tm_map(dd, removeWords, stopwords("english"))
dd <- tm_map(dd, stemDocument)
dd <- tm_map(dd, removeNumbers)
dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))
I can't find a way to operate on the Document Term Matrix to extract the information I want: the top three keywords by tf-idf for each document. How do I get that?
EDIT: Example text (all from the Yelp Review Academic Data Set):
doc1 <- "Luckily, I didn't have to travel far to make my connecting flight. And for this, I thank you, Phoenix. My brief layover was pleasant as the employees were kind and the flight was on time. Hopefully, next time I can grace Phoenix with my presence for a little while longer."
doc2 <- "Nobuo shows his unique talents with everything on the menu. Carefully crafted features with much to drink. Start with the pork belly buns and a stout. Then go on until you can no longer."
doc3 <- "The oldish man who owns the store is as sweet as can be. Perhaps sweeter than the cookies or ice cream. Here's the lowdown: Giant ice cream cookie sandwiches for super cheap. The flavor permutations are basically endless. I had snickerdoodle with cookies and cream ice cream. It was marvelous."
I should mention that I have over 180,000 documents of this nature, so a solution that scales, rather than one that works solely on these specific examples, would be great.
Upvotes: 1
Views: 3623
Reputation: 109864
This works:
apply(dtm, 1, function(x) {
x2 <- sort(x, TRUE)
x2[x2 >= x2[3]]
})
## $doc1
## flight phoenix time
## 0.126797 0.126797 0.126797
##
## $doc2
## belli bun care craft drink everyth featur
## 0.08805347 0.08805347 0.08805347 0.08805347 0.08805347 0.08805347 0.08805347
## menu much nobuo pork show start stout
## 0.08805347 0.08805347 0.08805347 0.08805347 0.08805347 0.08805347 0.08805347
## talent uniqu
## 0.08805347 0.08805347
##
## $doc3
## cream cooki ice
## 0.2113283 0.1584963 0.1584963
If you want it to scale up I'd use parallel computing.
Upvotes: 2