R Best way to merge large dataframes of the same size

Question

I have one huge dataframe (400k rows by 8000 cols), another dataframe that's 400k rows by 3 cols, and a third dataframe that's 402k rows by 3 cols. Each df has 2 key columns. Label them:

dfBig - biggest df

dfSmall1 - the df with 402k rows

dfSmall2 - the df with the same number of rows

when I try to compile all the information with the following dplyr solution my RStudio session gets aborted:

dfCompile <- dfBig %>%
  left_join(dfSmall1, by = c('Key1', 'Key2')) %>% 
  left_join(dfSmall2, by = c('Key1', 'Key2')) %>%
  select(Key1, Key2, newCol1, newCol2, everything())

I can't even run dfBig %>% arrange(Key1, Key2) without it blowing up. What I've done instead to make it work is:

# Grab only the keys to cut down on size
dfInfo <- dfBig %>% select(Key1, Key2)

# Merge other dfs to the info DF
dfInfo <- dfInfo %>%
  left_join(payoutMap, by = c('Key1', 'Key2')) %>% 
  left_join(ranks, by = c('Key1', 'Key2')) %>% 
  select(-Key1, -Key2)

# Cbind the rest of the columns back in since I know they'll be in the same order (arrange(.) blows it up)
dfBig <- cbind(dfBig, dfInfo) %>% 
  select(Key1, Key2, newCol1, newCol2, everything())

It works, but it doesn't feel very eloquent, and I don't want to have to mess with it again if it breaks down with an even bigger dataframe. Can sqldf or data.table handle this better? Just want to understand the best path to go down before I start learning for myself.

Joshua Mire · Accepted Answer

I believe using data.frame() would be fastest, but will only work properly when there is complete intersection between the two data frames you are joining. You just have to give it a key = such as in the example below (this might be what you want since you were left merging your larger longer data set onto your shorter data set, thus chopping off at least two unmatched values assuming you only have unique values):

dfCompile <- dfBig %>%
  data.frame(dfSmall1, key = c('Key1', 'Key2')) %>% 
  data.frame(dfSmall2, key = c('Key1', 'Key2')) %>%
  select(Key1, Key2, newCol1, newCol2, everything())

Otherwise, using the data.table R package is probably the way to go.

library(data.table)
setDT(dfBig)
setDT(dfSmall1)
setDT(dfSmall2)
dfCompile <- dfBig %>%
  merge(dfSmall1, by = c('Key1', 'Key2'), all = TRUE) %>% 
  merge(dfSmall2, by = c('Key1', 'Key2'), all = TRUE) %>%
  merge(Key1, Key2, newCol1, newCol2, everything())

Note: if you plan to use the method using the data.table package, in order to left join, as was in your example, you want to change all = TRUE to all.x = TRUE in your merge() functions.

I hope this helps!

R Best way to merge large dataframes of the same size

Answers (2)

Related Questions