Jacob
Jacob

Reputation: 329

R Multiple Dataframe Column Matches to Populate Column

I have a dataframe 'df1' that looks like:

structure(list(MAPS_code = c("SARI", "SABO", "SABO", "SABO", 
"ISLA", "TROP"), Location_code = c("LCP-", "LCP-", "LCP-", "LCP-", "LCP-",
"LCP-"), Contact = c("Chase Mendenhall", "Chase Mendenhall", "Chase Mendenhall", 
"Chase Mendenhall", "Chase Mendenhall", "Chase Mendenhall"), Lat = c(NA, NA, NA, 
NA, NA, "51.23"), Long = c(NA, NA, NA, NA, NA, "-109.26")), row.names = c(NA, 6L), class = "data.frame")

And a second dataframe 'df2' that looks like:

structure(list(MAPS_code = c("SAFR", "SAGA", "ELPU", "ISLA", 
"SABO", "SATE", "QUST", "SARI", "PANA", "COPA", "LOAN", "GAPA", 
"MELI", "CAGO", "PINO", "GABO", "RIJA", "FILA", "AMIS"), Lat = c(8.765833, 
8.751389, 8.768611, 8.835833, 8.801111, 8.808333, 8.815, 8.827778, 
8.781667, 8.778333, 8.783333, 8.800833, 8.790278, 8.754444, 8.844444, 
8.801389, 8.786667, 8.785278, 8.952222), Long = c(-82.94277, 
-82.951111, -82.95, -82.963056, -82.917222, -82.924444, -82.923889, 
-82.924167, -82.896944, -82.955833, -82.938611, -82.972222, -82.967222, 
-82.925833, -82.97, -82.972222, -82.964722, -82.976111, -82.833333
), Contact = c("Chase Mendenhall", "Chase Mendenhall", "Chase Mendenhall", 
"Chase Mendenhall", "Chase Mendenhall", "Chase Mendenhall", "Chase Mendenhall", 
"Chase Mendenhall", "Chase Mendenhall", "Chase Mendenhall", "Chase Mendenhall", 
"Chase Mendenhall", "Chase Mendenhall", "Chase Mendenhall", "Chase Mendenhall", 
"Chase Mendenhall", "Chase Mendenhall", "Chase Mendenhall", "Chase Mendenhall"
), Location = c("LCP-", "LCP-", "LCP-", "LCP-", "LCP-", "LCP-", 
"LCP-", "LCP-", "LCP-", "LCP-", "LCP-", "LCP-", "LCP-", "LCP-", 
"LCP-", "LCP-", "LCP-", "LCP-", "LCP-")), class = "data.frame", row.names = c(NA, 
-19L))

How can I make populate each row of 'Lat' and 'Long' of df1 from 'Lat' and 'Long' of df2 when 'Contact', 'Location', and 'MAPS_code' of the corresponding rows match between df1 and df2? So that the result of df1 looks like:

structure(list(MAPS_code = c("SARI", "SABO", "SABO", "SABO", 
"ISLA", "TROP"), Location_code = c("LCP-", "LCP-", "LCP-", "LCP-", "LCP-", 
"LCP-"), Contact = c("Chase Mendenhall", "Chase Mendenhall", "Chase Mendenhall", 
"Chase Mendenhall", "Chase Mendenhall", "Chase Mendenhall"), Lat = c("8.827778", "8.801111", "8.801111
", "8.801111", "8.835833", "51.23"), Long = c("-82.92417", "-82.91722", "-82.91722", "-82.91722", "-82.96306", "-109.26")), row.names = c(NA, 6L), class = "data.frame")

Note, if there are already data in Lat and Long, I don't want them deleted or written over with NAs.

Upvotes: 3

Views: 106

Answers (2)

Anoushiravan R
Anoushiravan R

Reputation: 21908

Updated Answer We could use dplyr::coalesce to retrieve the values which is not NA among the two pairs of Lat and Long:

library(dplyr)

df1 %>%
  rename(Location = Location_code) %>%
  left_join(df2, by = c('MAPS_code', 'Contact', 'Location')) %>%
  mutate(across(ends_with('.x'), as.double)) %>%
  mutate(Lat = coalesce(!!!(select(., starts_with('Lat')))), 
         Long = coalesce(!!!select(., starts_with('Long')))) %>%
  select(!contains('.'))


  MAPS_code Location          Contact       Lat       Long
1      SARI     LCP- Chase Mendenhall  8.827778  -82.92417
2      SABO     LCP- Chase Mendenhall  8.801111  -82.91722
3      SABO     LCP- Chase Mendenhall  8.801111  -82.91722
4      SABO     LCP- Chase Mendenhall  8.801111  -82.91722
5      ISLA     LCP- Chase Mendenhall  8.835833  -82.96306
6      TROP     LCP-        Tom Jones 51.230000 -109.26000

Upvotes: 3

divibisan
divibisan

Reputation: 12155

This is very similar to @Anoushiravan R's answer, but I'd suggest removing unnecessary variables before joining, rather than wrestling with the variable names generated by duplicate columns:

df1 %>%
    select(-Lat, -Long) %>%
    left_join(df2, by=c('MAPS_code', 'Location_code'='Location', 'Contact'))


  MAPS_code Location_code          Contact      Lat      Long
1      SARI          LCP- Chase Mendenhall 8.827778 -82.92417
2      SABO          LCP- Chase Mendenhall 8.801111 -82.91722
3      SABO          LCP- Chase Mendenhall 8.801111 -82.91722
4      SABO          LCP- Chase Mendenhall 8.801111 -82.91722
5      ISLA          LCP- Chase Mendenhall 8.835833 -82.96306

Passing a named vector element to the by= argument (for example 'Location_code'='Location') lets you join on non-matching variable names. The name on the left side of the = is the name in the first data.frame (or the one piped in), while the name on the right side is the name in the second data.frame.

Also, note that this will add the new variables onto the end (right) of the joined data frame. If you wanted to match the order of variables in df1 (that wouldn't apply here, since the order is the same), you can use select to reorder the variables:

... %>%
    select(one_of(names(df1)), everything())

names(df1) gets the variable names from df1 in order. By putting it in one_of, a variable missing in the final data.frame will give a warning instead of an error. Ending with everything, will ensure that no variables are lost, and any variable not in df1 will be stuck on the end.

Upvotes: 3

Related Questions