Reputation: 3656
I have a list of URL for which i have fetched the webcontent, and included that into tm corpora:
library(tm)
library(XML)
link <- c(
"http://www.r-statistics.com/tag/hadley-wickham/",
"http://had.co.nz/",
"http://vita.had.co.nz/articles.html",
"http://blog.revolutionanalytics.com/2010/09/the-r-files-hadley-wickham.html",
"http://www.analyticstory.com/hadley-wickham/"
)
create.corpus <- function(url.name){
doc=htmlParse(url.name)
parag=xpathSApply(doc,'//p',xmlValue)
if (length(parag)==0){
parag="empty"
}
cc=Corpus(VectorSource(parag))
meta(cc,"link")=url.name
return(cc)
}
link=catch$url
cc <- lapply(link, create.corpus)
This gives me a "large list" of corpora, one for each URL. Combining them one by one works:
x=cc[[1]]
y=cc[[2]]
z=c(x,y,recursive=T) # preserved metadata
x;y;z
# A corpus with 8 text documents
# A corpus with 2 text documents
# A corpus with 10 text documents
But this becomes unfeasible for a list with a few thousand corpora. So how can a list of corpora be merged into one corpus while maintaining the meta data?
Upvotes: 6
Views: 3554
Reputation: 15857
Your code does not work because catch
is not defined, so I don't know exactly what that is supposed to do.
But now tm corpora can just be put into a vector to make one big corpora: https://www.rdocumentation.org/packages/tm/versions/0.7-1/topics/tm_combine
So maybe c(unlist(cc))
would work. I have no way to test if that would work though because your code doesn't run.
Upvotes: 0
Reputation: 121608
I don't think that tm
offer any built-in function to join/merge many corpus. But after all a corpus is a list of document , so how the question is how to transform a list of list to a list. I would do create a new corpus using all documents , then assign meta manually:
y = Corpus(VectorSource(unlist(cc)))
meta(y,'link') = do.call(rbind,lapply(cc,meta))$link
Upvotes: 2
Reputation: 81733
You can use do.call
to call c
:
do.call(function(...) c(..., recursive = TRUE), cc)
# A corpus with 155 text documents
Upvotes: 5