Reputation: 1101

Inner join on large dataset best practices

I am trying to merge two large datasets (around 3.5m lines,each) using dplyr::inner_join. I am working on a powerful machine with 40+ cores. I am not sure I am taking advantage of the machine itself as I am not parallelizing the task anyhow. How should I tackle the problem, which is taking a lot to run?

Best

Upvotes: 2

Answers (2)

Sinh Nguyen

Reputation: 4487

I don't think 3.5M inner join going to have performance issues unless your two final datasets will be 3.5M * 3.5M after the join due to duplication of key columns in your datasets (duplicated values of joined columns)

Normally in R, there is no functions that will utilized multiple cores. To do it you will have to divide data in batch that could be process separately then combined final results together and calculated further. Here are pseudo code using library dplyr & doParallel

library(dplyr)
library(doParallel)

# Parallel configuration #####
cpuCount <- 10
# Note that doParallel will replicated your environment to and process on multiple core
# so if your environment is 10GB memory & you use 10 core
# it would required 10GBx10=100GB RAM to process data parallel
registerDoParallel(cpuCount)

data_1 # 3.5M rows records with key column is id_1 & value column value_1
data_2 # 3.5M rows records with key columns are id_1 & id_2

# Goal is to calculate some stats/summary of value_1 for each combination of id_1 + id_2
id_1_unique <- unique(data_1$id_1)
batchStep <- 1000
batch_id_1 <- seq(1, length(id_1_unique )+batchStep , by=batchStep )

# Do the join for each batch id_1 & summary/calculation then return the final_data
# foreach will result a list, for this psuedo code it is a list of datasets
# which can be combined use bind_rows
summaryData <- bind_rows(foreach(index=1:(length(batch_id_1)-1)) %dopar% {
    batch_id_1_current <- id_1_unique[index:index+batchStep-1]
    batch_data_1 <- data_1 %>% filter(id_1 %in% batch_id_1_current)
    joined_data <- inner_join(batch_data_1, data_2, by="id_1")
    final_data <- joined_data %>%
        group_by(id_1, id_2) %>%
        #calculation code here
        summary(calculated_value_1=sum(value_1)) %>%
        ungroup()
    return(final_data)
})

Upvotes: 2

GenesRus

Reputation: 1057

You should try the data.table package, which is much more efficient than dplyr for large datasets. I've copied over the inner join code from here.

library(data.table)
DT <- data.table(x=rep(c("b","a","c"),each=3), y=c(1,3,6), v=1:9)
X  <- data.table(x=c("c","b"), v=8:7, foo=c(4,2))
DT[X, on="x", nomatch=0] # inner join
                         # SELECT DT INNER JOIN X ON DT$x = X$x

Althought data.table doesn't use parallelization, it will be faster than inner_join and the best option to the best of my knowledge.

Upvotes: 0

Inner join on large dataset best practices

Answers (2)

Related Questions