Reputation: 485
First I create a document term matrix like below
dtm <- DocumentTermMatrix(docs)
Then I take the sum of the occurance of each word vectors as below
totalsums <- colSums(as.matrix(dtm))
My totalsums (R says type 'double') looks like below for first 7 elements.
aaab aabb aabc aacc abbb abbc abcc ...
9 2 10 4 7 3 12 ...
I managed to sort this with the following command
sorted.sums <- sort(totalsums, decreasing=T)
Now I want to extract the first 4 terms/words with the highest sums which are greater than value 5.
I could get the first 4 highest with sorted.sums[1:4]
but how can I set a threshold value?
I managed to do this with the order
function like below but, is there a way to do this than sort function or without using findFreqTerms
fucntion?
ord.totalsums <- order(totalsums)
findFreqTerms(dtm, lowfreq=5)
Appreciate your thoughts on this.
Upvotes: 0
Views: 594
Reputation: 388982
You can use
sorted.sums[sorted.sums > 5][1:4]
But if you have at least 4 values that are greater than 5 only using sorted.sums[1:4]
should work as well.
To get the words you can use names
.
names(sorted.sums[sorted.sums > 5][1:4])
Upvotes: 2