Gavin
Gavin

Reputation: 23

Joining Dataframes with Many to Many Relationship in R

I'm trying to join two dataframes in R, each of which has multiple repetitions of the same ID number. This is to be expected within the data and is not a data quality issue.

It was my understanding that I should do a left join between the two dataframes, specify the many to many relationship, and this would work. However, my IDs get duplicated in a way I don't expect.

Here is some test data:

test4 <- data.frame(ID = c('A', 'A', 'B', 'B', 'C'), Score = c(1,2,3,4,5))
test5 <- data.frame(ID = c('A', 'A', 'B', 'B', 'C'), Race = c('W','W','B','B','W'))
test6 <- test4 %>% left_join(test5, by = "ID",  relationship = "many-to-many")

This outputs the data like so:

  ID Score Race
   A     1    W
   A     1    W
   A     2    W
   A     2    W
   B     3    B
   B     3    B
   B     4    B
   B     4    B
   C     5    W

When I want it to output like so:

  ID Score Race
   A     1    W
   A     2    W
   B     3    B
   B     4    B
   C     5    W

I'm willing to bet that I'm missing something simple, but I just can't figure out what. Any advice much appreciated!

Upvotes: 2

Views: 1370

Answers (3)

Friede
Friede

Reputation: 7979

Looking at your sample data, a simple merge seems to be enough:

> merge(unique(test4), unique(test5), by = "ID")
  ID Score Race
1  A     1    W
2  A     2    W
3  B     3    B
4  B     4    B
5  C     5    W

If your sample data does not mimic your data at hand, please adapt accordingly.

Upvotes: 2

Jon Spring
Jon Spring

Reputation: 66880

I'm assuming you'd want to join the first row in ID1 to the first row in ID1 in other tables, etc. That could come in handy, for instance, if you have multiple tables of event data where each event is a row. In that case, you might want to keep all the rows of each table, but make sure they're aligned so they are consistent as to ID and which event corresponding to that ID.

But if there's only one "event" table, and otherwise the tables have duplicate data for each ID, simpler to use the left_join(... multiple = "first") approach @ThomasIsCoding suggested.

left_join(test4 |> mutate(row = row_number(), .by = ID),
          test5 |> mutate(row = row_number(), .by = ID),
          join_by(row, ID))


  ID Score row Race
1  A     1   1    W
2  A     2   2    W
3  B     3   1    B
4  B     4   2    B
5  C     5   1    W

Upvotes: 3

ThomasIsCoding
ThomasIsCoding

Reputation: 102529

you can use distinct before applying left_join, e.g.,

test4 %>%
    left_join(distinct(test5), by = "ID")

or specify multiple = "first" in left_join

test4 %>%
  left_join(test5, by = "ID", multiple = "first")

which gives

  ID Score Race
1 A  1     W   
2 A  2     W   
3 B  3     B   
4 B  4     B   
5 C  5     W   

Upvotes: 2

Related Questions