dplyr: summarize by group (sum) is extremely slow

Question

I have two datasets: inds (individuals) and hhs (households).

I'm trying to sum all inds$income grouped by UID (unique ID) and create a new column on hhs with the sum of the income of all individuals in a households. Some individuals don't have any income, in which case they have 'NA' for that variable. The code I'm using is:

hhs <- left_join(hhs, inds %>% group_by(UID) %>% summarize(hhincome = sum(income, na.rm=TRUE)))

However, it's extremely slow. inds has over 2 million rows, hhs has about 550k rows. I've used dplyr to average or count similar variables in the same datasets, and it usually takes ten seconds or less to get this done. Is there anything I'm doing wrong? Is there any way to make it faster?

Edit: as I said, I've used dplyer to get mean household values without a problem with

hhs <- left_join(hhs, inds %>% filter(AGE > 2) %>% group_by(UID) %>% summarize(L_Bilingual = mean(Bilingual, na.rm=TRUE)))

It takes 5 seconds to calculate the means with the code above. Is there anything particular to the function sum() that makes it slower?

eipi10 · Accepted Answer

With the fake data below, the summarize-and-join takes about 2 seconds on my machine, which is a new Macbook Pro. Even with a slower machine, it shouldn't take longer than maybe 10 or 15 seconds. Is it possible that the join is creating many more rows than you think? If you provide more details about the structure of your data, we can be more specific about what might be going wrong.

library(tidyverse)
library(microbenchmark)

# Generate two data frames with only the UID column in common
set.seed(2)
hhs = data.frame(UID=1:550000, replicate(30, runif(550000))) %>% 
  set_names(c("UID", paste0("V",1:30)))
inds = data.frame(UID=sample(1:550000, 2e6, replace=TRUE), 
                  income=rnorm(2e6, 5e4, 1e4),
                  replicate(20, rnorm(2e6, 5e4, 1e4)))

microbenchmark(join=left_join(hhs, inds %>% group_by(UID) %>% 
                                summarize(hhincome = sum(income, na.rm=TRUE))),
               times=5)

Unit: seconds
 expr      min       lq     mean median       uq      max neval
 join 1.924749 1.988773 2.722018 2.0063 2.068044 5.622223     5

dplyr: summarize by group (sum) is extremely slow

Answers (2)

Related Questions