Reputation: 575
I have a dataset with Zip codes and corresponding cities. In some cases the zip code is missing, so I want to replace it with a zip code from the zipcode library in R.
Obviously 'New York' has more than one zip code. In my dataset with transactions the same residents appear multiple times, hence also their city e.g. 'New York' appears multiple times.
Using dplyr's left_join function, joining on the city name, I get the corresponding zip codes for the city name 'New York', like so:
10001,
10002,
10003,
etc.
Comparing this to vlookup, Excel would always take the first possible lookup match, in this case 10001.
Based on what logic is here R matching 'New York' with different zip codes in each row?
Upvotes: 0
Views: 1609
Reputation: 25385
A left join will always take all entries in the left table, and add the matches from the right table:
I think the image below shows the logic clearly, you can ignore the SQL statement there.
If you left_join doctors
and visits
, we will take the left table as starting point, and add all matches from the right table. In this case, the doctor with doctor_id
212 has two matches in the table visits
, and thus both vistis are added to the resulting table.
What Excel does is thus not a left join. It just looks for a single reference value and ignores the rest.
If you want to replicate the behavior from Excel you could filter the visits table first, by removing any duplicates in the join column. For example:
visits = data.frame(doctor_id=c('a','b','c','a'),time=c(1,2,3,4))
visits = visits[!duplicated(visits$doctor_id),]
And then use that table for your left join. Hope this helps!
Upvotes: 4
Reputation: 1000
Left join actually merge your "x" tibble with "y" tibble keeping all the entries of x but joining only entries of y (by particular variable) which have matches in x.
you can see here below
dataframe "x"
rank title
<chr> <chr>
1 2,801. Lines
2 2,802. Nocno zivljenje
3 2,803. Saber Google
4 2,804. Sacred Vow
5 2,805. Was hat uns bloß so ruiniert
6 2,806. Our Time
7 2,807. Orecchie
8 2,808. Marshrut postroen
9 2,809. Shor Se Shuruaat
10 2,810. Oru Murai Vanthu Paarthaya
dataframe "y"
rank genre
1 2,801. Drama
2 2,802. Crime, Drama, Mystery
3 2,803. Comedy
4 2,804. Drama
5 2,805. Comedy
6 2,806. Drama
after applying left_join
left_join(x,y, by ="rank")
" rank title genre
<chr> <chr> <chr>
1 2,801. Lines Drama
2 2,802. Nocno zivljenje Crime, Drama, Mystery
3 2,803. Saber Google Comedy
4 2,804. Sacred Vow <NA>
5 2,805. Was hat uns bloß so ruiniert <NA>
6 2,806. Our Time <NA>
7 2,807. Orecchie <NA>
8 2,808. Marshrut postroen <NA>
9 2,809. Shor Se Shuruaat <NA>
10 2,810. Oru Murai Vanthu Paarthaya <NA>"
Upvotes: 2