Reputation: 15718
I have done some complicated computations inside my database using dplyr
, containing one-to-many joins and then filtering results, so that I no longer really know whether my computer memory is sufficient to collect the results of my computation, e.g.
library(dplyr)
tbl(src_con, 'table_name') %>%
inner_join(...) %>%
filter(...) %>%
inner_join(...) %>%
... %>% # more stuff
collect(n = Inf) # will this crash the computer?
What is an efficient way of estimating whether the resultant tibble will be too large to fit in memory? I am using dplyr 0.5.0 and PostgreSQL 9.5.
Upvotes: 0
Views: 124
Reputation: 78832
Use compute()
to store the query results in a temporary table (it won't send the data back to R) then ask postgres abt that table:
library(dplyr)
db <- src_postgres("postgres", host="localhost", user="bob")
star <- tbl(db, "AllstarFull")
group_by(star, yearID) %>%
summarise(n=n()) -> res
explain(res)
res_comp <- compute(res, name="temptbl")
tbl(db, sql("SELECT pg_size_pretty(sum(pg_column_size(temptbl))) FROM temptbl")) %>% collect()
## # A tibble: 1 × 1
## pg_size_pretty
## * <chr>
## 1 3317 bytes
tbl(db, sql("SELECT pg_size_pretty(sum(pg_column_size(temptbl))/3) FROM temptbl")) %>% collect()
## # A tibble: 1 × 1
## pg_size_pretty
## * <chr>
## 1 1105 bytes
object.size(collect(res_comp))
## 1896 bytes
Why divide by 3? There's overhead in the calculation for pg_column_size()
if done table-wide (you'd have to create a query summing the column sizes for each column individually to avoid that). The simple division is really just an estimate of the actual size but it's fast and good enough for government work.
You can use this method to guesstimate a ball-park for the resultant return size before you collect()
.
Upvotes: 1