Reputation: 601
So my actual dataset is 16 million rows and confidential, but I can illustrate what's happening fairly easily. I don't understand this behaviour at all, it flies in the face of everything I've read, or at least I think it does.
So here's a dataframe, with strings and dates (the real one has more columns and more rows)
library(tidyverse)
test = data.frame("a" = letters,
"b" = seq.Date(as.Date("2018-01-01"),
as.Date("2018-01-26"), "days")
)
I want to produce a third column, pasting together the first two. I do it like this:
finalTest = test %>%
mutate(c = paste(a, b))
If I do this, with 16 million rows, it goes from about 2GB RAM used to nearly 8GB and the process gets killed by the server (which has 8GB of RAM).
However, if I split the dataset in two, paste the columns, and then rbind, it's fine, even though by doing so I'm creating unnecessary objects (the whole dataset is only about 700MB, so it does make sense that the objects fit in RAM).
test1 = test %>%
filter(row_number() <= floor(n()/2)) %>%
mutate(c = paste(a, b))
test2 = test %>%
filter(row_number() > floor(n()/2)) %>%
mutate(c = paste(a, b))
finalTest2 = rbind(test1, test2)
This is fine. It seems like the objects fit in memory, but not when you're operating on them. But what's happening that is so memory intensive?
I do not understand at all. Is this expected behaviour? Is it unique to paste? Pasting with strings and dates? Something else?
Upvotes: 4
Views: 394
Reputation: 36
I've been through it too... If you start having 16M Rows in your data frames, I suggest to really not bother optimising memory usage with dplyr, just go for data.table. So much faster, memory efficient, although having complex syntax but there are workarounds (below).
Just be sure you understand that data.table memory management is generally speaking by reference unlike dplyr who makes copies ( that's a reason for the performance differences).
Since syntax is IMHO difficult with data.table and can be a bit hard at the beginning, you can use dtplyr package to translate your dplyr code to data.table ( use show_query function) or check this webpage :
https://atrebas.github.io/post/2019-03-03-datatable-dplyr/
I find it very usefull for people familiar with dplyr but not data.table.
If you really want to stick to dplyr, be sure that the data.frame you are using was not grouped somewhere before in your code, this involves surprising behaviours sometimes if you forget about it (use ungroup()).
Upvotes: 1