Sawol
Sawol

Reputation: 33

How to filter documents in a tm corpus in R based on metadata?

I am using the R tm package and I am trying to select certain documents by their index and their metadata:

orbit_corpus<-Corpus( tm_corpus, readerControl = list(reader=myReader))

meta(my_corpus[[1]])

author  : a8
origin  : Department 
heading : WhiB
id      : 1
year    : 2013

I would like to get find all documents that within the first hundred documents of my corpus that have been published in 2013. This works to identify whether the metadata 'year' for document 1 are 2013.

meta(my_corpus[[1]],"year") == 2013
[1] TRUE

I need something that gives me the option to find among the first 100 all indexes, which meet the criterion. I would imagine something similar to this (but it does not work and unfortunately would probably also not generate a list of the documents).

meta(orbit_corpus[[1:100]],"year") == 2013
Error in x$content[[i]] : recursive indexing failed at level 4

Many thanks for the help!

Upvotes: 3

Views: 2674

Answers (1)

Steven Beaupr&#233;
Steven Beaupr&#233;

Reputation: 21621

You could use tm_filter on the first 100 documents of your corpus (orbit_corpus[1:100])

tm_filter(orbit_corpus[1:100], FUN = function(x) meta(x)[["year"]] == "2013")

From the documentation

tm_filter returns a corpus containing documents where FUN matches

Upvotes: 4

Related Questions