Trgovec
Trgovec

Reputation: 575

How does left_join really work? (in R; dplyr)

I have a dataset with Zip codes and corresponding cities. In some cases the zip code is missing, so I want to replace it with a zip code from the zipcode library in R.

Obviously 'New York' has more than one zip code. In my dataset with transactions the same residents appear multiple times, hence also their city e.g. 'New York' appears multiple times.

Using dplyr's left_join function, joining on the city name, I get the corresponding zip codes for the city name 'New York', like so:

10001, 
10002, 
10003, 
etc.

Comparing this to vlookup, Excel would always take the first possible lookup match, in this case 10001.

Based on what logic is here R matching 'New York' with different zip codes in each row?

Upvotes: 0

Views: 1609

Answers (2)

Florian
Florian

Reputation: 25385

A left join will always take all entries in the left table, and add the matches from the right table:

enter image description here

I think the image below shows the logic clearly, you can ignore the SQL statement there.

If you left_join doctors and visits, we will take the left table as starting point, and add all matches from the right table. In this case, the doctor with doctor_id 212 has two matches in the table visits, and thus both vistis are added to the resulting table.

enter image description here

What Excel does is thus not a left join. It just looks for a single reference value and ignores the rest.

If you want to replicate the behavior from Excel you could filter the visits table first, by removing any duplicates in the join column. For example:

visits = data.frame(doctor_id=c('a','b','c','a'),time=c(1,2,3,4))
visits = visits[!duplicated(visits$doctor_id),]

And then use that table for your left join. Hope this helps!

Upvotes: 4

Akarsh Jain
Akarsh Jain

Reputation: 1000

Left join actually merge your "x" tibble with "y" tibble keeping all the entries of x but joining only entries of y (by particular variable) which have matches in x.

you can see here below

dataframe "x"

     rank                        title
    <chr>                        <chr>
 1 2,801.                        Lines
 2 2,802.              Nocno zivljenje
 3 2,803.                 Saber Google
 4 2,804.                   Sacred Vow
 5 2,805. Was hat uns bloß so ruiniert
 6 2,806.                     Our Time
 7 2,807.                     Orecchie
 8 2,808.            Marshrut postroen
 9 2,809.             Shor Se Shuruaat
10 2,810.   Oru Murai Vanthu Paarthaya


dataframe "y"

    rank                 genre
1 2,801.                 Drama
2 2,802. Crime, Drama, Mystery
3 2,803.                Comedy
4 2,804.                 Drama
5 2,805.                Comedy
6 2,806.                 Drama

after applying left_join

left_join(x,y, by ="rank")

"     rank                        title                 genre
    <chr>                        <chr>                 <chr>
 1 2,801.                        Lines                 Drama
 2 2,802.              Nocno zivljenje Crime, Drama, Mystery
 3 2,803.                 Saber Google                Comedy
 4 2,804.                   Sacred Vow                  <NA>
 5 2,805. Was hat uns bloß so ruiniert                  <NA>
 6 2,806.                     Our Time                  <NA>
 7 2,807.                     Orecchie                  <NA>
 8 2,808.            Marshrut postroen                  <NA>
 9 2,809.             Shor Se Shuruaat                  <NA>
10 2,810.   Oru Murai Vanthu Paarthaya                  <NA>"

Upvotes: 2

Related Questions