Werner Hertzog
Werner Hertzog

Reputation: 2022

dplyr: summarize by group (sum) is extremely slow

I have two datasets: inds (individuals) and hhs (households).

I'm trying to sum all inds$income grouped by UID (unique ID) and create a new column on hhs with the sum of the income of all individuals in a households. Some individuals don't have any income, in which case they have 'NA' for that variable. The code I'm using is:

hhs <- left_join(hhs, inds %>% group_by(UID) %>% summarize(hhincome = sum(income, na.rm=TRUE)))

However, it's extremely slow. inds has over 2 million rows, hhs has about 550k rows. I've used dplyr to average or count similar variables in the same datasets, and it usually takes ten seconds or less to get this done. Is there anything I'm doing wrong? Is there any way to make it faster?

Edit: as I said, I've used dplyer to get mean household values without a problem with

hhs <- left_join(hhs, inds %>% filter(AGE > 2) %>% group_by(UID) %>% summarize(L_Bilingual = mean(Bilingual, na.rm=TRUE)))

It takes 5 seconds to calculate the means with the code above. Is there anything particular to the function sum() that makes it slower?

Upvotes: 4

Views: 5045

Answers (2)

Werner Hertzog
Werner Hertzog

Reputation: 2022

Here's what was making it slow: I hadn't realized that inds$income had labels:

> head(inds$income)
<Labelled double>: Earned income
[1]      0      0      0      0 258000      0

Labels:
value                  label
 99999998       Unknown/missing.
 99999999 NIU (not in universe).

Problem was fixed when I removed the labels (recoding the column with as.numeric()).

Upvotes: 4

eipi10
eipi10

Reputation: 93861

With the fake data below, the summarize-and-join takes about 2 seconds on my machine, which is a new Macbook Pro. Even with a slower machine, it shouldn't take longer than maybe 10 or 15 seconds. Is it possible that the join is creating many more rows than you think? If you provide more details about the structure of your data, we can be more specific about what might be going wrong.

library(tidyverse)
library(microbenchmark)

# Generate two data frames with only the UID column in common
set.seed(2)
hhs = data.frame(UID=1:550000, replicate(30, runif(550000))) %>% 
  set_names(c("UID", paste0("V",1:30)))
inds = data.frame(UID=sample(1:550000, 2e6, replace=TRUE), 
                  income=rnorm(2e6, 5e4, 1e4),
                  replicate(20, rnorm(2e6, 5e4, 1e4)))

microbenchmark(join=left_join(hhs, inds %>% group_by(UID) %>% 
                                summarize(hhincome = sum(income, na.rm=TRUE))),
               times=5)
Unit: seconds
 expr      min       lq     mean median       uq      max neval
 join 1.924749 1.988773 2.722018 2.0063 2.068044 5.622223     5

Upvotes: 6

Related Questions