Reputation: 747
I have one huge dataframe (400k rows by 8000 cols), another dataframe that's 400k rows by 3 cols, and a third dataframe that's 402k rows by 3 cols. Each df has 2 key columns. Label them:
dfBig
- biggest df
dfSmall1
- the df with 402k rows
dfSmall2
- the df with the same number of rows
when I try to compile all the information with the following dplyr
solution my RStudio session gets aborted:
dfCompile <- dfBig %>%
left_join(dfSmall1, by = c('Key1', 'Key2')) %>%
left_join(dfSmall2, by = c('Key1', 'Key2')) %>%
select(Key1, Key2, newCol1, newCol2, everything())
I can't even run dfBig %>% arrange(Key1, Key2)
without it blowing up. What I've done instead to make it work is:
# Grab only the keys to cut down on size
dfInfo <- dfBig %>% select(Key1, Key2)
# Merge other dfs to the info DF
dfInfo <- dfInfo %>%
left_join(payoutMap, by = c('Key1', 'Key2')) %>%
left_join(ranks, by = c('Key1', 'Key2')) %>%
select(-Key1, -Key2)
# Cbind the rest of the columns back in since I know they'll be in the same order (arrange(.) blows it up)
dfBig <- cbind(dfBig, dfInfo) %>%
select(Key1, Key2, newCol1, newCol2, everything())
It works, but it doesn't feel very eloquent, and I don't want to have to mess with it again if it breaks down with an even bigger dataframe. Can sqldf
or data.table
handle this better? Just want to understand the best path to go down before I start learning for myself.
Upvotes: 0
Views: 1818
Reputation: 1279
It sounds like the size of the data is too large to fit into memory and R is crashing. Here are a few ideas to work around it.
It sounds like you've tried dplyr::left_join
and data.table::merge
, one other option would be to try base::merge
, although admittedly it is a longshot if the others didn't work.
Do you need all of the columns in your data -- can you remove some unneeded ones before performing the merge?
Another approach you could take is to add swap space to your system. You can make your swap space as large as you want, but your application may crawl to a near-halt if too much swap is required.
You try to use a package like disk.frame which performs the operation one chunk at-a-time. See https://diskframe.com/.
Can you find a machine with more memory? You could rent a machine on a cloud provider such as AWS, GCP, Azure, or elsewhere.
Upvotes: 0
Reputation: 736
I believe using data.frame()
would be fastest, but will only work properly when there is complete intersection between the two data frames you are joining. You just have to give it a key =
such as in the example below (this might be what you want since you were left merging your larger longer data set onto your shorter data set, thus chopping off at least two unmatched values assuming you only have unique values):
dfCompile <- dfBig %>%
data.frame(dfSmall1, key = c('Key1', 'Key2')) %>%
data.frame(dfSmall2, key = c('Key1', 'Key2')) %>%
select(Key1, Key2, newCol1, newCol2, everything())
Otherwise, using the data.table
R package is probably the way to go.
library(data.table)
setDT(dfBig)
setDT(dfSmall1)
setDT(dfSmall2)
dfCompile <- dfBig %>%
merge(dfSmall1, by = c('Key1', 'Key2'), all = TRUE) %>%
merge(dfSmall2, by = c('Key1', 'Key2'), all = TRUE) %>%
merge(Key1, Key2, newCol1, newCol2, everything())
Note: if you plan to use the method using the data.table
package, in order to left join, as was in your example, you want to change all = TRUE
to all.x = TRUE
in your merge()
functions.
I hope this helps!
Upvotes: 2