CptNemo
CptNemo

Reputation: 6755

How to add metadata to tm Corpus object with tm_map

I have been reading different questions/answers (especially here and here) without managing to apply any to my situation.

I have a 11,390 rows matrix with attributes id, author, text, such as:

library(tm)

m <- cbind(c("01","02","03","04","05","06"),
           c("Author1","Author2","Author2","Author3","Author3","Auhtor4"),
           c("Text1","Text2","Text3","Text4","Text5","Text6"))

I want to create a tm corpus out of it. I can quickly create my corpus with

tm_corpus <- Corpus(VectorSource(m[,3]))

which terminates execution for my 11,390 rows matrix in

   user  system elapsed 
  2.383   0.175   2.557 

But then when I try to add metadata to the corpus with

meta(tm_corpus, type="local", tag="Author") <- m[,2]

the execution time is over the 15 minutes and counting (I then stopped execution).

According to the discussion here chances are to decreasing significantly the time in processing the corpus with tm_map; something like

tm_corpus <- tm_map(tm_corpus, addMeta, m[,2])

Still I am not sure how to do this. Probably it is going to be something like

addMeta <- function(text, vector) {
  meta(text, tag="Author") = vector[??]
  text
}

For one thing how to pass to tm_map a vector of values to be assign to each text of the corpus? Should I call the function from within a loop? Should I enclose the tm_map function within vapply?

Upvotes: 2

Views: 5768

Answers (3)

Ilona
Ilona

Reputation: 476

Since readTabular from tm package has been deprecated, now the solution might be like this:

matrix <- cbind(c("Author1","Author2","Author2","Author3","Author3","Auhtor4"),
                c("Text1","Text2","Text3","Text4","Text5","Text6"))
matrix <- as.data.frame(matrix)
names(matrix) <- c("doc_id", "text")
tm_corpus <- DataframeSource(matrix)
tm_corpus <- Corpus(tm_corpus)
inspect(tm_corpus)
meta(tm_corpus)

Upvotes: 0

Dennis Proksch
Dennis Proksch

Reputation: 260

Have you already tried the excellent readTabular?

## your sample data
matrix <- cbind(c("01","02","03","04","05","06"),
       c("Author1","Author2","Author2","Author3","Author3","Auhtor4"),
       c("Text1","Text2","Text3","Text4","Text5","Text6"))

## simple transformations
matrix <- as.data.frame(matrix)
names(matrix) <- c("id", "author", "content")

Now your ex-matrix now data.frame can be read easily in as a corpus using readTabular. ReadTabular wants you to define a Reader which itselfs takes a mapping. In your mapping "content" points to the text data and the other names - well - to meta.

## define myReader, which will be used in creation of Corpus
myReader <- readTabular(mapping=list(id="id", author="author", content="content"))

Now the creation of the Corpus is same as always, apart from small changes:

## create the corpus
tm_corpus <- DataframeSource(matrix)
tm_corpus <- Corpus(tm_corpus,
    readerControl = list(reader=myReader))

Now have a look at the content and meta data of the first items:

lapply(tm_corpus, as.character)
lapply(tm_corpus, meta)
## output just as expected.

This should be fast, as it is part of the package and extremely adaptable. In my own project I am using this on a data.table with some 20 variables - it works like a charm.

However I cannot provide benchmarking with the answer you have already approved as suitable. I simply guess it is faster and more efficient.

Upvotes: 7

agstudy
agstudy

Reputation: 121568

Yes tm_map is faster and it is the way to go. You should use it here with a global counter.

auths <- paste0('Author',seq(11390))
i <- 0
tm_corpus = tm_map(tm_corpus, function(x) {
   i <<- i +1
   meta(x, "Author") <- m[i,2]
   x
})

Upvotes: 3

Related Questions