Reputation: 708
I'm loading text documents from the database, then I create corpus from them, and finally I set prefixed id of the document (I need to use prefix, since I've got documents of several types).
rs <- dbSendQuery(con,"SELECT id::TEXT, content FROM entry")
entry.d = data.table(fetch(rs,n=-1))
entry.vs = VectorSource(entry.d$content)
entry.vc = VCorpus(entry.vs, readerControl = list(language = "pl"))
meta(entry.vc, tag = 'id', type = 'local') = paste0("e:",entry.d$id)
This works very slow. It takes 6 minutes, when
tm_map(entry.vc, tm_reduce, tmFuns = funs, mc.cores=1)
where funs
is the list of 6 functions, needs only 2 minutes more.
Is there any way to do it faster?
Upvotes: 1
Views: 99
Reputation: 708
I've changed my code to set IDs during initialization of the VCorpus.
rs <- dbSendQuery(con,"SELECT ('e:'||id) AS id, content, 'pl'::TEXT AS language FROM entry")
entry.d = data.table(fetch(rs,n=-1))
entry.dfs = DataframeSource(entry.d)
reader <- readTabular(mapping=list(content="content", id="id", language='language'))
entry.vc = VCorpus(entry.dfs, readerControl = list(reader = reader))
And now it takes only 2.5 minute to generate VCorpus with custom IDs.
Upvotes: 2