Find Frequent Word and its Value in Document Term Frequency

Question

So I have to find the most frequent word and its value from a DTM.

library('tm') 
library("SnowballC")  
my.text.location "C:/Users/mrina/OneDrive/Documents/../"
apapers <- VCorpus(DirSource(my.text.location)) class(apapers)
apapers <- tm_map(apapers, removeNumbers) 
apapers <- tm_map(apapers, removePunctuation) 
apapers <- tm_map(apapers, stemDocument, language ="en")

This is for cleaning the Corpus and the below one creating the DTM and finding the frequency.

ptm.tf <- DocumentTermMatrix(apapers) 
dim(ptm.tf)
findFreqTerms(ptm.tf)

Is there a way to get the frequent word and the frequency value together?

phiver · Accepted Answer

findFreqTerms is nothing more than using rowsums on a sparse Matrix. The function uses slam's row_sums. To keep the counts with the words we can use the same functions. The slam package is installed when you installed tm, so the functions are available if you load slam or call them via slam::. Using the functions from slam is better as they work on sparse matrices. Base rowsums would transform the sparse matrix into a dense matrix which is slower and uses a lot more memory.

# your code.....
ptm.tf <- DocumentTermMatrix(apapers) 

# using col_sums since it is a document term matrix. If it is a term document matrix use row_sums
frequency <- slam::col_sums(ptm.tf)
# Filtering like findFreqTerms. Find words that occur 10 times or more. 
frequency <- frequency[frequency >= 10]

# turn into data.frame if needed:
frequency_df <- data.frame(words = names(frequency ), freq = frequency , row.names = NULL)

Find Frequent Word and its Value in Document Term Frequency

Answers (2)

Related Questions