Reputation: 2022
I have two datasets: inds
(individuals) and hhs
(households).
I'm trying to sum all inds$income
grouped by UID
(unique ID) and create a new column on hhs
with the sum of the income of all individuals in a households. Some individuals don't have any income, in which case they have 'NA' for that variable. The code I'm using is:
hhs <- left_join(hhs, inds %>% group_by(UID) %>% summarize(hhincome = sum(income, na.rm=TRUE)))
However, it's extremely slow. inds
has over 2 million rows, hhs
has about 550k rows. I've used dplyr
to average or count similar variables in the same datasets, and it usually takes ten seconds or less to get this done. Is there anything I'm doing wrong? Is there any way to make it faster?
Edit: as I said, I've used dplyer
to get mean household values without a problem with
hhs <- left_join(hhs, inds %>% filter(AGE > 2) %>% group_by(UID) %>% summarize(L_Bilingual = mean(Bilingual, na.rm=TRUE)))
It takes 5 seconds to calculate the means with the code above. Is there anything particular to the function sum()
that makes it slower?
Upvotes: 4
Views: 5045
Reputation: 2022
Here's what was making it slow: I hadn't realized that inds$income
had labels:
> head(inds$income)
<Labelled double>: Earned income
[1] 0 0 0 0 258000 0
Labels:
value label
99999998 Unknown/missing.
99999999 NIU (not in universe).
Problem was fixed when I removed the labels (recoding the column with as.numeric()
).
Upvotes: 4
Reputation: 93861
With the fake data below, the summarize-and-join takes about 2 seconds on my machine, which is a new Macbook Pro. Even with a slower machine, it shouldn't take longer than maybe 10 or 15 seconds. Is it possible that the join is creating many more rows than you think? If you provide more details about the structure of your data, we can be more specific about what might be going wrong.
library(tidyverse)
library(microbenchmark)
# Generate two data frames with only the UID column in common
set.seed(2)
hhs = data.frame(UID=1:550000, replicate(30, runif(550000))) %>%
set_names(c("UID", paste0("V",1:30)))
inds = data.frame(UID=sample(1:550000, 2e6, replace=TRUE),
income=rnorm(2e6, 5e4, 1e4),
replicate(20, rnorm(2e6, 5e4, 1e4)))
microbenchmark(join=left_join(hhs, inds %>% group_by(UID) %>%
summarize(hhincome = sum(income, na.rm=TRUE))),
times=5)
Unit: seconds expr min lq mean median uq max neval join 1.924749 1.988773 2.722018 2.0063 2.068044 5.622223 5
Upvotes: 6