CoolGuyHasChillDay
CoolGuyHasChillDay

Reputation: 747

R Best way to merge large dataframes of the same size

I have one huge dataframe (400k rows by 8000 cols), another dataframe that's 400k rows by 3 cols, and a third dataframe that's 402k rows by 3 cols. Each df has 2 key columns. Label them:

dfBig - biggest df

dfSmall1 - the df with 402k rows

dfSmall2 - the df with the same number of rows

when I try to compile all the information with the following dplyr solution my RStudio session gets aborted:

dfCompile <- dfBig %>%
  left_join(dfSmall1, by = c('Key1', 'Key2')) %>% 
  left_join(dfSmall2, by = c('Key1', 'Key2')) %>%
  select(Key1, Key2, newCol1, newCol2, everything())

I can't even run dfBig %>% arrange(Key1, Key2) without it blowing up. What I've done instead to make it work is:

# Grab only the keys to cut down on size
dfInfo <- dfBig %>% select(Key1, Key2)

# Merge other dfs to the info DF
dfInfo <- dfInfo %>%
  left_join(payoutMap, by = c('Key1', 'Key2')) %>% 
  left_join(ranks, by = c('Key1', 'Key2')) %>% 
  select(-Key1, -Key2)

# Cbind the rest of the columns back in since I know they'll be in the same order (arrange(.) blows it up)
dfBig <- cbind(dfBig, dfInfo) %>% 
  select(Key1, Key2, newCol1, newCol2, everything())

It works, but it doesn't feel very eloquent, and I don't want to have to mess with it again if it breaks down with an even bigger dataframe. Can sqldf or data.table handle this better? Just want to understand the best path to go down before I start learning for myself.

Upvotes: 0

Views: 1818

Answers (2)

dpritch
dpritch

Reputation: 1279

It sounds like the size of the data is too large to fit into memory and R is crashing. Here are a few ideas to work around it.

  1. It sounds like you've tried dplyr::left_join and data.table::merge, one other option would be to try base::merge, although admittedly it is a longshot if the others didn't work.

  2. Do you need all of the columns in your data -- can you remove some unneeded ones before performing the merge?

  3. Another approach you could take is to add swap space to your system. You can make your swap space as large as you want, but your application may crawl to a near-halt if too much swap is required.

  4. You try to use a package like disk.frame which performs the operation one chunk at-a-time. See https://diskframe.com/.

  5. Can you find a machine with more memory? You could rent a machine on a cloud provider such as AWS, GCP, Azure, or elsewhere.

Upvotes: 0

Joshua Mire
Joshua Mire

Reputation: 736

I believe using data.frame() would be fastest, but will only work properly when there is complete intersection between the two data frames you are joining. You just have to give it a key = such as in the example below (this might be what you want since you were left merging your larger longer data set onto your shorter data set, thus chopping off at least two unmatched values assuming you only have unique values):

dfCompile <- dfBig %>%
  data.frame(dfSmall1, key = c('Key1', 'Key2')) %>% 
  data.frame(dfSmall2, key = c('Key1', 'Key2')) %>%
  select(Key1, Key2, newCol1, newCol2, everything())

Otherwise, using the data.table R package is probably the way to go.

library(data.table)
setDT(dfBig)
setDT(dfSmall1)
setDT(dfSmall2)
dfCompile <- dfBig %>%
  merge(dfSmall1, by = c('Key1', 'Key2'), all = TRUE) %>% 
  merge(dfSmall2, by = c('Key1', 'Key2'), all = TRUE) %>%
  merge(Key1, Key2, newCol1, newCol2, everything())

Note: if you plan to use the method using the data.table package, in order to left join, as was in your example, you want to change all = TRUE to all.x = TRUE in your merge() functions.

I hope this helps!

Upvotes: 2

Related Questions