Reputation: 11
I'm trying to merge two data.tables, both of which are around 60-80 million rows in length. I know that data.table is already built so that it is very adept at merging, but I'm wondering for data of this size is data.table still more efficient than potentially parallelizing it, especially since I have access to a computing cluster.
This is what I'm currently doing.
setorder(fcc_temp, BlockCode)
setorder(BlockCode, block_fips)
fcc_temp[block_data_long, c("pop", "tract") := list(pop, tract),
on = c(BlockCode="block_fips", year="year")]
Upvotes: 1
Views: 772
Reputation: 16697
From your example we don't see much details like data types.
data.table join is currently single threaded. Some small parts of it uses multiple cores but this is AFAIR only finding order in the join columns. Computing matches goes in single thread.
Have in mind that parallelizing join is non trivial and will not scale that well as many other operations, so potential gains are much lesser than for in terms of grouping.
Anyway this computing matches is still super fast. We run a benchmark where we compare join, one of the questions (question 5) is "big to big join" which seems to correspond to your scenario. https://h2oai.github.io/db-benchmark/ Below is 100M data size for join task. Q5 is a join of 100M LHS to 100M RHS:
You can see that data.table is pretty much on the top. Note that we are joining on single integer column there, so there is likely to be a difference to your scenario where you join on two columns.
Benchmark does not take into account possibility to pre-sort data. Try using setkey
(instead of setorder
) on your tables to sort them by join columns.
To be fair I believe it might not be easy to beat this kind of setup.
In future version computing matches of join will be parallelized as well, a drafts of that is already in repository.
Upvotes: 2