darkpunk
darkpunk

Reputation: 17

Find Frequent Word and its Value in Document Term Frequency

So I have to find the most frequent word and its value from a DTM.

library('tm') 
library("SnowballC")  
my.text.location "C:/Users/mrina/OneDrive/Documents/../"
apapers <- VCorpus(DirSource(my.text.location)) class(apapers)
apapers <- tm_map(apapers, removeNumbers) 
apapers <- tm_map(apapers, removePunctuation) 
apapers <- tm_map(apapers, stemDocument, language ="en")

This is for cleaning the Corpus and the below one creating the DTM and finding the frequency.

ptm.tf <- DocumentTermMatrix(apapers) 
dim(ptm.tf)
findFreqTerms(ptm.tf)

Is there a way to get the frequent word and the frequency value together?

Upvotes: 0

Views: 1389

Answers (2)

phiver
phiver

Reputation: 23608

findFreqTerms is nothing more than using rowsums on a sparse Matrix. The function uses slam's row_sums. To keep the counts with the words we can use the same functions. The slam package is installed when you installed tm, so the functions are available if you load slam or call them via slam::. Using the functions from slam is better as they work on sparse matrices. Base rowsums would transform the sparse matrix into a dense matrix which is slower and uses a lot more memory.

# your code.....
ptm.tf <- DocumentTermMatrix(apapers) 

# using col_sums since it is a document term matrix. If it is a term document matrix use row_sums
frequency <- slam::col_sums(ptm.tf)
# Filtering like findFreqTerms. Find words that occur 10 times or more. 
frequency <- frequency[frequency >= 10]

# turn into data.frame if needed:
frequency_df <- data.frame(words = names(frequency ), freq = frequency , row.names = NULL)

Upvotes: 2

Jas
Jas

Reputation: 834

If you you don't mind using another package, this should work (instead of creating DTM object):

library('tm') 
library("SnowballC")  
my.text.location "C:/Users/mrina/OneDrive/Documents/../"
apapers <- VCorpus(DirSource(my.text.location)) 
class(apapers)
apapers <- tm_map(apapers, removeNumbers) 
apapers <- tm_map(apapers, removePunctuation) 
apapers <- tm_map(apapers, stemDocument, language ="en")

# new lines here
library(qdap)
freq_terms(apapers)                ^

Created on 2018-09-28 by the reprex package (v0.2.0).

Upvotes: 0

Related Questions