Reputation: 81
I am trying to run an LDA using the topicmodels package in R. The example given in the manual uses Associated Press data and works nicely. However, when I try it on my own data I get topics whose terms are the document names. I have traced the problem to the fact that my term document matrix is the transpose of the way is should be (rows -> columns).
The example TDM:
str(AssociatedPress)
List of 6
$ i : int [1:302031] 1 1 1 1 1 1 1 1 1 1 ...
$ j : int [1:302031] 116 153 218 272 299 302 447 455 548 597 ...
$ v : int [1:302031] 1 2 1 1 1 1 2 1 1 1 ...
$ nrow : int 2246
$ ncol : int 10473
$ dimnames:List of 2
..$ Docs : NULL
..$ Terms: chr [1:10473] "aaron" "abandon" "abandoned" "abandoning" ...
- attr(*, "Weighting")= chr [1:2] "term frequency" "tf"
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
Whereas,my TDM has Terms as rows, and Docs as columns:
List of 6
$ i : int [1:10489] 1 3 4 13 20 24 25 26 27 28 ...
$ j : int [1:10489] 1 1 1 1 1 1 1 1 1 1 ...
$ v : num [1:10489] 1 1 1 1 2 1 67 1 44 3 ...
$ nrow : int 5903
$ ncol : int 9
$ dimnames:List of 2
..$ Terms: chr [1:5903] "\u2439aa" "aars" "\u2439ab" "\u242dab" ...
..$ Docs : chr [1:9] "art111130.txt" "art111131.txt" "art111132.txt" "art111133.txt" ...
- attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
- attr(*, "Weighting")= chr [1:2] "term frequency" "tf"
Which is causing LDA(art_tdm,3)
to build topics based on doc names, not terms within docs. Is this a change in the codebase of the tm package? I can't imagine what I would be doing to cause this transposition in my code:
art_cor<-Corpus(DirSource(directory = "tmptxts"))
art_tdm<-TermDocumentMatrix(art_cor)
Any help would be appreciated.
Upvotes: 4
Views: 1533
Reputation: 263411
On the one hand you have an object of class "TermDocumentMatrix" and the other you have one of "DocumentTermMatrix".
You probably just need to do this:
art_tdm<-DocumentTermMatrix(art_cor)
Upvotes: 3