Reputation: 681
If I have the following:
x <- data.frame(
Row_Index = 1:5,
Name = c("Alan", "Bob", "Charles", "David", "Eric"),
Age = c(49, 23, 44, 52, 18),
City = c("London", "Paris", "Berlin", "Moscow", "Tokyo")
)
y <- data.frame(
Claim_Reference = 1:6,
Row_Index = c(3, 2, 2, 4, 6, 4),
Claim_Amount = c(100, 1000, 500, 200, 300, 5000)
)
z <- x %>% left_join(y, by = c("Row_Index" = "Row_Index")) %>%
group_by (Row_Index, Name, Age) %>%
summarise(Total_Claim_Amount = sum(Claim_Amount))
it produces a nice joined table where for each individual in x
I can see their Name
, Age
and Total_Claim_Amount
. All ok.
It would be sufficient for grouping purposes to use Row_Index
alone in the group_by()
statement and skip Name
and Age
, but then they won't appear in the resulting table, which isn't what I want.
In a real life example, I'm doing exactly the same type of lookup, but with many more fields. my left join query has 55 variables inside the group_by()
statement and 16 variables inside the summarise()
statement. It's overwhelming my PC.
Is there a more efficient way to do this? It's something I need to do quite often. Should I, for example, move the "redundant" variables in the group_by()
statement into the summarise statement, preceded by a first()
or something like that?
Thank you.
Upvotes: 1
Views: 1381
Reputation: 145775
z <- y %>%
group_by(Row_index) %>%
summarize(...) %>%
right_join(x, by = "Row_index")
# same result, much more efficiently.
In your example, you add a bunch of columns to y
with the join, 55 columns, with lots of repeated information. Grouping by and summarizing all those columns means R has to go through every single column and make sure there aren't any mismatches with row_index
that would require the creation of a new group. You know that each row_index
defines a group, so you should tell R to group only by row_index
, do your summarize, and then do the join to add contextual information for each row_index
. This should be exponentially faster with the number of columns.
If you want additional speed, you could switch to data.table
, but my guess is this will adequately solve your speed problem.
Upvotes: 3