How does left_join really work? (in R; dplyr)

Question

I have a dataset with Zip codes and corresponding cities. In some cases the zip code is missing, so I want to replace it with a zip code from the zipcode library in R.

Obviously 'New York' has more than one zip code. In my dataset with transactions the same residents appear multiple times, hence also their city e.g. 'New York' appears multiple times.

Using dplyr's left_join function, joining on the city name, I get the corresponding zip codes for the city name 'New York', like so:

10001, 
10002, 
10003, 
etc.

Comparing this to vlookup, Excel would always take the first possible lookup match, in this case 10001.

Based on what logic is here R matching 'New York' with different zip codes in each row?

Florian · Accepted Answer

A left join will always take all entries in the left table, and add the matches from the right table:

I think the image below shows the logic clearly, you can ignore the SQL statement there.

If you left_join doctors and visits, we will take the left table as starting point, and add all matches from the right table. In this case, the doctor with doctor_id 212 has two matches in the table visits, and thus both vistis are added to the resulting table.

What Excel does is thus not a left join. It just looks for a single reference value and ignores the rest.

If you want to replicate the behavior from Excel you could filter the visits table first, by removing any duplicates in the join column. For example:

visits = data.frame(doctor_id=c('a','b','c','a'),time=c(1,2,3,4))
visits = visits[!duplicated(visits$doctor_id),]

And then use that table for your left join. Hope this helps!

How does left_join really work? (in R; dplyr)

Answers (2)

Related Questions