Reputation: 111
I have a data frame called "pertanian" :
DOCS <- c(1:5)
TEXT <- c("tanaman jagung seumur jagung " ,
"tanaman jagung kacang ketimun rusak dimakan kelinci" ,
"ladang diserbu kelinci tanaman jagung kacang ketimun rusak dimakan" ,
"ladang diserbu kelinci tanaman jagung kacang ketimun rusak dimakan" ,
"ladang diserbu kelinci tanaman jagung kacang ketimun rusak ")
pertanian <- data.frame(DOCS , TEXT)
From data frame i created, then i make a term document frequency like this:
term DOCS 1 DOCS 2 DOCS 3 DOCS 4 DOCS 5
dimakan 0 1 1 1 0
diserbu 0 0 1 1 1
jagung 2 1 1 1 1
kacang 0 1 1 1 1
kelinci 0 1 1 1 1
ketimun 0 1 1 1 1
ladang 0 0 1 1 1
rusak 0 1 1 1 1
seumur 1 0 0 0 0
tanaman 1 1 1 1 1
From term document matrix above, i want to make a document frequency like this:
Term DF
dimakan 3
diserbu 3
jagung 5
kacang 4
kelinci 4
ketimun 4
ladang 3
rusak 4
seumur 1
tanaman 5
i have tried this code :
myCorpus <- Corpus(VectorSource(pertanian$TEXT))
myCorpus2 <- tm_map(myCorpus, PlainTextDocument)
tdm <- TermDocumentMatrix(myCorpus2)
temp<-inspect(tdm)
colnames(temp) <- paste("DOCS", pertanian$DOCS)
Doc.Freq<-data.frame(apply(temp, 1, sum))
#rename column name
Doc.Freq <- cbind(Term = rownames(Doc.Freq), Doc.Freq)
row.names(Doc.Freq) <- NULL
names(Doc.Freq)[names(Doc.Freq)=="apply.temp..1..sum."] <- "DF"
but, the output result produced "term frequency" not "document frequency", because term 'jagung' calculated as 6, it should be 5 for document frequency
Upvotes: 3
Views: 683
Reputation: 21641
Something like this ?
Note: Here I assume that your desired output has an error and kacang is present in 4 of the 5 docs
library(tm)
library(dplyr)
v <- Corpus(VectorSource(TEXT))
data.frame(inspect(TermDocumentMatrix(v))) %>%
add_rownames() %>%
mutate(DF = rowSums(.[-1] >= 1)) %>%
select(Term = rowname, DF)
Which gives:
#Source: local data frame [10 x 2]
#
# Term DF
#1 dimakan 3
#2 diserbu 3
#3 jagung 5
#4 kacang 4
#5 kelinci 4
#6 ketimun 4
#7 ladang 3
#8 rusak 4
#9 seumur 1
#10 tanaman 5
Or you could simply do:
transform(rowSums(inspect(TermDocumentMatrix(v)) >= 1))
Upvotes: 5
Reputation: 4554
Try this:
dd <- strsplit(as.character(TEXT),' ')
> transform(table(unlist(lapply(dd,unique))))
# Var1 Freq
#1 dimakan 3
#2 diserbu 3
#3 jagung 5
#4 kacang 4
#5 kelinci 4
#6 ketimun 4
#7 ladang 3
#8 rusak 4
#9 seumur 1
#10 tanaman 5
Upvotes: 1