gcbrianbrown
gcbrianbrown

Reputation: 45

R: How to join dataframes with a key that has duplicate values, and COPY the second frame's values to each duplicate?

I'm trying to join two dataframes. One is home sale data and the other is distance from a railroad. I am trying to join the data using a parcel ID as a key. The parcel ID in the sales data will have multiple values, because the parcels have sold several times. I want the data to join so that the distance values will repeat for each instance of the parcel ID, something like this.

Dataframe A:

ID Sale Date Sale Price
123 1/1/2020 320000
123 2/1/2021 429000
124 3/1/2019 190000
124 2/13/2020 280000
124 1/1/2022 419000
125 2/1/2021 300000
125 1/1/2022 390000
126 2/1/2021 310000

Dataframe B:

ID Distance
123 1290
124 1809
125 370
126 976

I want the joined dataframe to look like this:

ID Sale Date Sale Price Distance
123 1/1/2020 320000 1290
123 2/1/2021 429000 1290
124 3/1/2019 190000 1809
124 2/13/2020 280000 1809
124 1/1/2022 419000 1809
125 2/1/2021 300000 370
125 1/1/2022 390000 370
126 2/1/2021 310000 976

The distance values are copied for each parcel ID. When I use left_join I'm getting a new dataframe with more rows than either of the original dataframes (which is one problem) with a lot of NA values for distance (which is another problem). I don't understand why my new dataframe is larger than the largest of the original dataframes, and I don't know how to make it so that distance is repeated across each instance of the parcel ID and not just copied once and then NA thereafter.

When I searched for a solution for this problem, I mostly found methods of dropping duplicate rows, which I don't want. I want the rows in dataframe B to be copied for each instance of the ID in dataframe A.

Upvotes: 2

Views: 862

Answers (1)

Quinten
Quinten

Reputation: 41225

You can use left_join by your ID from the dplyr package. You can use the following code:

df_A <- data.frame(ID = c(123, 123, 124, 124, 124, 125, 125, 126),
                   Sale_Date = c("1/1/2020", "2/1/2021", "3/1/2019", "2/13/2020", "1/1/2022", "2/1/2021", "1/1/2021", "2/1/2021"),
                   Sale_Price = c(320000, 429000, 190000, 280000, 419000, 300000, 390000, 310000))

df_B <- data.frame(ID = c(123, 124, 125, 126),
                   Distance = c(1290, 1809, 370, 976))

library(dplyr)
df_joined <- left_join(df_A, df_B, by = "ID")
df_joined

Output:

   ID Sale_Date Sale_Price Distance
1 123  1/1/2020     320000     1290
2 123  2/1/2021     429000     1290
3 124  3/1/2019     190000     1809
4 124 2/13/2020     280000     1809
5 124  1/1/2022     419000     1809
6 125  2/1/2021     300000      370
7 125  1/1/2021     390000      370
8 126  2/1/2021     310000      976

Upvotes: 2

Related Questions