Reputation: 578
I have a corpus with over 5000 text files. I would like to get individual word counts for each file after running pre-processing each (turning to lower, removing stopwords, etc). I haven't had any luck with the word count for the individual text files. Any help would be appreciated.
library(tm)
revs<-Corpus(DirSource("data/"))
revs<-tm_map(revs,tolower)
revs<-tm_map(revs,removeWords, stopwords("english"))
revs<-tm_map(revs,removePunctuation)
revs<-tm_map(revs,removeNumbers)
revs<-tm_map(revs,stripWhitespace)
dtm<-DocumentTermMatrix(revs)
Upvotes: 2
Views: 19114
Reputation: 51
You can try doing this:
for (m in 1:length(revs) {
sum(nchar(as.character(revs[[m]])))
}
Upvotes: 0
Reputation: 14902
Your question did not specify that you wanted only R-based solutions, so here is a really simple solution for counting your words in text files: using the Gnu utility wc
at a Terminal or command line, with -w
to specify words, e.g.
KB-iMac:~ kbenoit$ wc -w *.txt
3 mytempfile.txt
3 mytempfileAscii.txt
14 tweet12.txt
17 tweet12b.txt
37 total
The numbers shown are word counts for this set of illustrative text files.
wc
is included already on OS X and Linux, and can be installed for Windows from the Rtools set.
Upvotes: 0
Reputation: 14902
You can also do this in the quanteda package that I developed with Paul Nulty. It is easy to create your own corpus using the quanteda
tools for this purpose, but it also imports tm
VCorpus objects directly (as shown below).
You can get token counts per document using the summary()
method for the corpus object type, or by creating a document-feature matrix using dfm()
and then using rowSums()
on the resulting document-feature matrix. dfm()
by default applies the cleaning steps that you would need to apply separately using the tm
package.
data(crude, package="tm")
mycorpus <- corpus(crude)
summary(mycorpus)
## Corpus consisting of 20 documents.
##
## Text Types Tokens Sentences
## reut-00001.xml 56 90 8
## reut-00002.xml 224 439 21
## reut-00004.xml 39 51 4
## reut-00005.xml 49 66 6
## reut-00006.xml 59 88 3
## reut-00007.xml 229 443 25
## reut-00008.xml 232 420 23
## reut-00009.xml 96 134 9
## reut-00010.xml 165 297 22
## reut-00011.xml 179 336 20
## reut-00012.xml 179 360 23
## reut-00013.xml 67 92 3
## reut-00014.xml 68 103 7
## reut-00015.xml 71 97 4
## reut-00016.xml 72 109 4
## reut-00018.xml 90 144 9
## reut-00019.xml 117 194 13
## reut-00021.xml 47 77 12
## reut-00022.xml 142 281 12
## reut-00023.xml 30 43 8
##
## Source: Converted from tm VCorpus 'crude'.
## Created: Sun May 31 18:24:07 2015.
## Notes: .
mydfm <- dfm(mycorpus)
## Creating a dfm from a corpus ...
## ... indexing 20 documents
## ... tokenizing texts, found 3,979 total tokens
## ... cleaning the tokens, 115 removed entirely
## ... summing tokens by document
## ... indexing 1,048 feature types
## ... building sparse matrix
## ... created a 20 x 1048 sparse dfm
## ... complete. Elapsed time: 0.039 seconds.
rowSums(mydfm)
## reut-00001.xml reut-00002.xml reut-00004.xml reut-00005.xml reut-00006.xml reut-00007.xml
## 90 439 51 66 88 443
## reut-00008.xml reut-00009.xml reut-00010.xml reut-00011.xml reut-00012.xml reut-00013.xml
## 420 134 297 336 360 92
## reut-00014.xml reut-00015.xml reut-00016.xml reut-00018.xml reut-00019.xml reut-00021.xml
## 103 97 109 144 194 77
## reut-00022.xml reut-00023.xml
## 281 43
I'm happy to help with any quanteda
-related questions.
Upvotes: 5
Reputation: 42313
As Tyler notes, your question is incomplete without a reproducible example. Here's how to make a reproducible example for this kind of question - use the data that comes built-in with the package:
library("tm") # version 0.6, you seem to be using an older version
data(crude)
revs <- tm_map(crude, content_transformer(tolower))
revs <- tm_map(revs, removeWords, stopwords("english"))
revs <- tm_map(revs, removePunctuation)
revs <- tm_map(revs, removeNumbers)
revs <- tm_map(revs, stripWhitespace)
dtm <- DocumentTermMatrix(revs)
And here's how to get a word count per document, each row of the dtm is one document, so you simply sum the columns for a row and you have the word count for the document:
# Word count per document
rowSums(as.matrix(dtm))
Upvotes: 11