How to join, group and summarise large dataframes in R with multidplyr and parallel

Question

This question is similar to other problems with very large data in R, but I can't find an example of how to merge/join and then perform calculations on two dfs (as opposed to reading in lots of dataframes and using mclapply to do the calculations). Here the problem is not loading the data (takes ~20 min but they do load), but rather the merging & summarising.

I've tried all data.table approachesI could find, different types of joins, and ff, and I still run into the problem of vecseq limits 2^31 rows. Now I'm trying to use multidplyr to do it in parallel, but can't figure out where the error is coming from.

Dataframes: species_data # df with ~ 65 million rows with cols <- c("id","species_id") lookup # df with ~ 17 million rows with cols <- c("id","cell_id","rgn_id") Not all ids in the lookup appear in the species_data

## make sample dataframes:

lookup <- data.frame(id = seq(2001,2500, by = 1), 
                     cell_id = seq(1,500, by = 1), 
                     rgn_id = seq(801,1300, by = 1))

library(stringi)

species_id <- sprintf("%s%s%s", stri_rand_strings(n = 1000, length = 3, pattern = "[A-Z]"),
                      pattern = "-",
                      stri_rand_strings(1000, length = 5, '[1-9]'))

id <- sprintf("%s%s%s", stri_rand_strings(n = 1000, length = 1, pattern = "[2]"),
                    stri_rand_strings(n = 1000, length = 1, pattern = "[0-4]"),
                    stri_rand_strings(n = 1000, length = 1, pattern = "[0-9]"))

species_data <- data.frame(species_id, id)

merge and join dfs with multidplyr

library(tidyverse)
install.packages("devtools")
library(devtools)
devtools::install_github("hadley/multidplyr") 
library(multidplyr)
library(parallel)

species_summary <- species_data %>%
  # partition the species data by species id
  partition(species_id, cluster = cluster) %>%
  left_join(species_data, lookup, by = "id") %>%
  dplyr::select(-id) %>%
  group_by(species_id) %>%
  ## total number of cells each species occurs in
  mutate(tot_count_cells = n_distinct(cell_id)) %>%
  ungroup() %>%
  dplyr::select(c(cell_id, species_id, rgn_id, tot_count_cells)) %>%
  group_by(rgn_id, species_id) %>% 
  ## number of cells each species occurs in each region
  summarise(count_cells_eez = n_distinct(cell_id)) %>% 
  collect() %>%
  as_tibble()
## Error in partition(., species_id, cluster = cluster) : unused argument (species_id)

## If I change to:
species_summary <- species_data %>%
  group_by(species_id) %>%
  partition(cluster = cluster) %>% ...
## get, "Error in worker_id(data, cluster) : object 'cluster' not found

This is my first attempt at parallel and big data and I'm struggling to diagnose the errors.

Thanks!

How to join, group and summarise large dataframes in R with multidplyr and parallel

Answers (1)

Related Questions