Ihda
Ihda

Reputation: 111

How to calculate document frequency in R?

I have a data frame called "pertanian" :

DOCS <- c(1:5)
TEXT <- c("tanaman jagung seumur jagung " , 
          "tanaman jagung kacang ketimun rusak dimakan kelinci" , 
          "ladang diserbu kelinci tanaman jagung kacang ketimun rusak dimakan" , 
          "ladang diserbu kelinci tanaman jagung kacang ketimun rusak dimakan" , 
          "ladang diserbu kelinci tanaman jagung kacang ketimun rusak ")
pertanian <- data.frame(DOCS , TEXT)

From data frame i created, then i make a term document frequency like this:

term     DOCS 1  DOCS 2  DOCS 3  DOCS 4  DOCS 5
dimakan    0       1       1       1       0
diserbu    0       0       1       1       1
jagung     2       1       1       1       1
kacang     0       1       1       1       1
kelinci    0       1       1       1       1
ketimun    0       1       1       1       1
ladang     0       0       1       1       1
rusak      0       1       1       1       1
seumur     1       0       0       0       0
tanaman    1       1       1       1       1

From term document matrix above, i want to make a document frequency like this:

Term        DF
dimakan     3 
diserbu     3
jagung      5
kacang      4
kelinci     4
ketimun     4
ladang      3
rusak       4
seumur      1
tanaman     5

i have tried this code :

myCorpus <- Corpus(VectorSource(pertanian$TEXT))
myCorpus2 <- tm_map(myCorpus, PlainTextDocument)
tdm <- TermDocumentMatrix(myCorpus2)
temp<-inspect(tdm)
colnames(temp) <- paste("DOCS", pertanian$DOCS)
Doc.Freq<-data.frame(apply(temp, 1, sum))
#rename column name
Doc.Freq <- cbind(Term = rownames(Doc.Freq), Doc.Freq)
row.names(Doc.Freq) <- NULL
names(Doc.Freq)[names(Doc.Freq)=="apply.temp..1..sum."] <- "DF"

but, the output result produced "term frequency" not "document frequency", because term 'jagung' calculated as 6, it should be 5 for document frequency

Upvotes: 3

Views: 683

Answers (2)

Steven Beaupr&#233;
Steven Beaupr&#233;

Reputation: 21641

Something like this ?

Note: Here I assume that your desired output has an error and kacang is present in 4 of the 5 docs

library(tm)
library(dplyr)

v <- Corpus(VectorSource(TEXT))

data.frame(inspect(TermDocumentMatrix(v))) %>%
  add_rownames() %>%
  mutate(DF = rowSums(.[-1] >= 1)) %>%
  select(Term = rowname, DF)

Which gives:

#Source: local data frame [10 x 2]
#
#      Term DF
#1  dimakan  3
#2  diserbu  3
#3   jagung  5
#4   kacang  4
#5  kelinci  4
#6  ketimun  4
#7   ladang  3
#8    rusak  4
#9   seumur  1
#10 tanaman  5

Or you could simply do:

transform(rowSums(inspect(TermDocumentMatrix(v)) >= 1))

Upvotes: 5

Shenglin Chen
Shenglin Chen

Reputation: 4554

Try this:

dd <- strsplit(as.character(TEXT),' ') 

> transform(table(unlist(lapply(dd,unique))))
#      Var1 Freq
#1  dimakan    3
#2  diserbu    3
#3   jagung    5
#4   kacang    4
#5  kelinci    4
#6  ketimun    4
#7   ladang    3
#8    rusak    4
#9   seumur    1
#10 tanaman    5

Upvotes: 1

Related Questions