millie0725
millie0725

Reputation: 393

R inspect() function, from tm package, only returns 10 outputs when using dictionary terms

I have 70 PDFs of scientific papers that I'm trying to narrow down by looking for specific terms within them, using the dictionary function of inspect(), which is part of the tm package. My PDFs are stored in a VCorpus object. Here's an example of what my code looks like using the crude dataset and common terms that would show up in (probably) every example paper in crude:

library(tm)
output.matrix <- inspect(DocumentTermMatrix(crude,
                                      list(dictionary = c("i","and",
                                                          "all","of",
                                                          "the","if",
                                                          "i'm","looking",
                                                          "for","but","because","has",
                                                          "it","was"))))
output <- data.frame(output.matrix)

This search only ever returns 10 papers into output.matrix. The outcome given is:

Docs  all and because but for has i i'm the was
  144   0   9       0   5   5   2 0   0  17   1
  236   0   7       4   2   4   5 0   0  15   7
  237   1  11       1   3   3   2 0   0  30   2
  246   0   9       0   0   6   1 0   0  18   2
  248   1   6       1   1   2   0 0   0  27   4
  273   0   5       2   2   4   1 0   0  21   1
  368   0   1       0   1   0   0 0   0  11   2
  489   0   5       0   0   4   0 0   0   8   0
  502   0   6       0   1   5   0 0   0  13   0
  704   0   5       1   0   3   2 0   0  21   0

For my actual dataset of 70 papers, I know there should be greater than 10 because as I add more PDFs to my VCorpus, which I know contain at least one of my search terms, I still only get 10 in the output. I want to adjust the outcome to be a list, like the one shown, that gives every paper from the VCorpus that contains a term, not just what I assume is the first 10.

Using R version 4.0.2, macOS High Sierra 10.13.6

Upvotes: 2

Views: 3177

Answers (1)

phiver
phiver

Reputation: 23608

You are misinterpreting what inspect does. For a document term matrix it show the first 10 rows and columns. inspect should only be used to check your corpus or document term matrix if it looks as you expect. Never for transforming data to a data.frame. If you want the data of the document term matrix in a data.frame, the following piece of code does this, using your example code and removing all the rows and columns that don't have a value for any of the documents or terms.

# do not use inspect as this will give a wrong result!
output.matrix <- DocumentTermMatrix(crude,
                                    list(dictionary = c("i","and",
                                                        "all","of",
                                                        "the","if",
                                                        "i'm","looking",
                                                        "for","but","because","has",
                                                        "it","was")))


# remove rows and columns that are 0 staying inside a sparse matrix for speed
out <- output.matrix[slam::row_sums(output.matrix) > 0,
                     slam::col_sums(output.matrix) > 0]


# transform to data.frame
out_df <- data.frame(docs = row.names(out), as.matrix(out), row.names = NULL)

out_df
   docs all and because but for. has the was
1   127   0   1       0   0    2   0   5   1
2   144   0   9       0   5    5   2  17   1
3   191   0   0       0   0    2   0   4   0
4   194   1   1       0   0    2   0   4   1
5   211   0   2       0   0    2   0   8   0
6   236   0   7       4   2    4   5  15   7
7   237   1  11       1   3    3   2  30   2
8   242   0   3       0   1    1   1   6   1
9   246   0   9       0   0    6   1  18   2
10  248   1   6       1   1    2   0  27   4
11  273   0   5       2   2    4   1  21   1
12  349   0   2       0   0    0   0   5   0
13  352   0   3       0   0    0   0   7   1
14  353   0   1       0   0    2   1   4   3
15  368   0   1       0   1    0   0  11   2
16  489   0   5       0   0    4   0   8   0
17  502   0   6       0   1    5   0  13   0
18  543   0   0       0   0    3   0   5   1
19  704   0   5       1   0    3   2  21   0
20  708   0   0       0   0    0   0   0   1

Upvotes: 3

Related Questions