Reputation: 578

R's tm package for word count

I have a corpus with over 5000 text files. I would like to get individual word counts for each file after running pre-processing each (turning to lower, removing stopwords, etc). I haven't had any luck with the word count for the individual text files. Any help would be appreciated.

library(tm)
revs<-Corpus(DirSource("data/")) 
revs<-tm_map(revs,tolower) 
revs<-tm_map(revs,removeWords, stopwords("english")) 
revs<-tm_map(revs,removePunctuation) 
revs<-tm_map(revs,removeNumbers) 
revs<-tm_map(revs,stripWhitespace) 
dtm<-DocumentTermMatrix(revs)

Upvotes: 2

Answers (4)

Rafa

Reputation: 51

You can try doing this:

for (m in 1:length(revs) {
sum(nchar(as.character(revs[[m]])))
}

Upvotes: 0

Ken Benoit

Reputation: 14902

Your question did not specify that you wanted only R-based solutions, so here is a really simple solution for counting your words in text files: using the Gnu utility wc at a Terminal or command line, with -w to specify words, e.g.

KB-iMac:~ kbenoit$ wc -w *.txt
       3 mytempfile.txt
       3 mytempfileAscii.txt
      14 tweet12.txt
      17 tweet12b.txt
      37 total

The numbers shown are word counts for this set of illustrative text files.

wc is included already on OS X and Linux, and can be installed for Windows from the Rtools set.

Upvotes: 0

Ken Benoit

Reputation: 14902

You can also do this in the quanteda package that I developed with Paul Nulty. It is easy to create your own corpus using the quanteda tools for this purpose, but it also imports tm VCorpus objects directly (as shown below).

You can get token counts per document using the summary() method for the corpus object type, or by creating a document-feature matrix using dfm() and then using rowSums() on the resulting document-feature matrix. dfm() by default applies the cleaning steps that you would need to apply separately using the tm package.

data(crude, package="tm")
mycorpus <- corpus(crude)
summary(mycorpus)
## Corpus consisting of 20 documents.
## 
## Text Types Tokens Sentences
## reut-00001.xml    56     90         8
## reut-00002.xml   224    439        21
## reut-00004.xml    39     51         4
## reut-00005.xml    49     66         6
## reut-00006.xml    59     88         3
## reut-00007.xml   229    443        25
## reut-00008.xml   232    420        23
## reut-00009.xml    96    134         9
## reut-00010.xml   165    297        22
## reut-00011.xml   179    336        20
## reut-00012.xml   179    360        23
## reut-00013.xml    67     92         3
## reut-00014.xml    68    103         7
## reut-00015.xml    71     97         4
## reut-00016.xml    72    109         4
## reut-00018.xml    90    144         9
## reut-00019.xml   117    194        13
## reut-00021.xml    47     77        12
## reut-00022.xml   142    281        12
## reut-00023.xml    30     43         8
## 
## Source:  Converted from tm VCorpus 'crude'.
## Created: Sun May 31 18:24:07 2015.
## Notes:   .
mydfm <- dfm(mycorpus)
## Creating a dfm from a corpus ...
## ... indexing 20 documents
## ... tokenizing texts, found 3,979 total tokens
## ... cleaning the tokens, 115 removed entirely
## ... summing tokens by document
## ... indexing 1,048 feature types
## ... building sparse matrix
## ... created a 20 x 1048 sparse dfm
## ... complete. Elapsed time: 0.039 seconds.
rowSums(mydfm)
## reut-00001.xml reut-00002.xml reut-00004.xml reut-00005.xml reut-00006.xml reut-00007.xml 
##             90            439             51             66             88            443 
## reut-00008.xml reut-00009.xml reut-00010.xml reut-00011.xml reut-00012.xml reut-00013.xml 
##            420            134            297            336            360             92 
## reut-00014.xml reut-00015.xml reut-00016.xml reut-00018.xml reut-00019.xml reut-00021.xml 
##            103             97            109            144            194             77 
## reut-00022.xml reut-00023.xml 
##            281             43

I'm happy to help with any quanteda-related questions.

Upvotes: 5

Ben

Reputation: 42313

As Tyler notes, your question is incomplete without a reproducible example. Here's how to make a reproducible example for this kind of question - use the data that comes built-in with the package:

library("tm") # version 0.6, you seem to be using an older version
data(crude)
revs <- tm_map(crude, content_transformer(tolower)) 
revs <- tm_map(revs, removeWords, stopwords("english")) 
revs <- tm_map(revs, removePunctuation) 
revs <- tm_map(revs, removeNumbers) 
revs <- tm_map(revs, stripWhitespace) 
dtm <- DocumentTermMatrix(revs)

And here's how to get a word count per document, each row of the dtm is one document, so you simply sum the columns for a row and you have the word count for the document:

# Word count per document
rowSums(as.matrix(dtm))

Upvotes: 11

R&#39;s tm package for word count

Answers (4)

Related Questions

R's tm package for word count