Alan
Alan

Reputation: 681

R - dplyr left_join() - Multiple Matches - How to Recombine...?

If I have the following:

x <- data.frame(
       Row_Index = 1:5,
       Name = c("Alan", "Bob", "Charles", "David", "Eric"),
       Age = c(49, 23, 44, 52, 18),
       City = c("London", "Paris", "Berlin", "Moscow", "Tokyo")
)

y <- data.frame(
       Claim_Reference = 1:6,
       Row_Index = c(3, 2, 2, 4, 6, 4),
       Claim_Amount = c(100, 1000, 500, 200, 300, 5000)
)

z <- x %>% left_join(y, by = c("Row_Index" = "Row_Index")) %>%
           group_by (Row_Index, Name, Age) %>%
           summarise(Total_Claim_Amount = sum(Claim_Amount))

it produces a nice joined table where for each individual in x I can see their Name, Age and Total_Claim_Amount. All ok.

It would be sufficient for grouping purposes to use Row_Index alone in the group_by() statement and skip Name and Age, but then they won't appear in the resulting table, which isn't what I want.

In a real life example, I'm doing exactly the same type of lookup, but with many more fields. my left join query has 55 variables inside the group_by() statement and 16 variables inside the summarise() statement. It's overwhelming my PC.

Is there a more efficient way to do this? It's something I need to do quite often. Should I, for example, move the "redundant" variables in the group_by() statement into the summarise statement, preceded by a first() or something like that?

Thank you.

Upvotes: 1

Views: 1381

Answers (1)

Gregor Thomas
Gregor Thomas

Reputation: 145775

z <- y %>% 
  group_by(Row_index) %>%
  summarize(...) %>% 
  right_join(x, by = "Row_index")
# same result, much more efficiently.

In your example, you add a bunch of columns to y with the join, 55 columns, with lots of repeated information. Grouping by and summarizing all those columns means R has to go through every single column and make sure there aren't any mismatches with row_index that would require the creation of a new group. You know that each row_index defines a group, so you should tell R to group only by row_index, do your summarize, and then do the join to add contextual information for each row_index. This should be exponentially faster with the number of columns.

If you want additional speed, you could switch to data.table, but my guess is this will adequately solve your speed problem.

Upvotes: 3

Related Questions