Is there a way to estimate amount of memory required when collecting a large table from a database?

Question

I have done some complicated computations inside my database using dplyr, containing one-to-many joins and then filtering results, so that I no longer really know whether my computer memory is sufficient to collect the results of my computation, e.g.

library(dplyr)
tbl(src_con, 'table_name') %>%
    inner_join(...) %>%
    filter(...) %>%
    inner_join(...) %>% 
    ... %>% # more stuff
    collect(n = Inf) # will this crash the computer?

What is an efficient way of estimating whether the resultant tibble will be too large to fit in memory? I am using dplyr 0.5.0 and PostgreSQL 9.5.

hrbrmstr · Accepted Answer

Use compute() to store the query results in a temporary table (it won't send the data back to R) then ask postgres abt that table:

library(dplyr)

db <- src_postgres("postgres", host="localhost", user="bob")

star <- tbl(db, "AllstarFull")

group_by(star, yearID) %>%
  summarise(n=n()) -> res

explain(res)

res_comp <- compute(res, name="temptbl")

tbl(db, sql("SELECT pg_size_pretty(sum(pg_column_size(temptbl))) FROM temptbl")) %>% collect()
## # A tibble: 1 × 1
##   pg_size_pretty
## *          
## 1     3317 bytes

tbl(db, sql("SELECT pg_size_pretty(sum(pg_column_size(temptbl))/3) FROM temptbl")) %>% collect()
## # A tibble: 1 × 1
##   pg_size_pretty
## *          
## 1     1105 bytes

object.size(collect(res_comp))
## 1896 bytes

Why divide by 3? There's overhead in the calculation for pg_column_size() if done table-wide (you'd have to create a query summing the column sizes for each column individually to avoid that). The simple division is really just an estimate of the actual size but it's fast and good enough for government work.

You can use this method to guesstimate a ball-park for the resultant return size before you collect().

Is there a way to estimate amount of memory required when collecting a large table from a database?

Answers (1)

Related Questions