How to do a sequential merge in R based on multiple columns in two same datasets

Question

I need to perform a sequential merging in R and what I mean by this is that let's say I have two datasets: orders and deliveries.

I want to match up these orders and deliveries together but I first want to merge based on the address column, then for the rows that don't match up, I want to merge based on zip code, then for those rows that don't match up, I want to merge based on latitude and longitude, then for those rows that don't match up, I want to merge on some other attribute and so forth.

I can easily do a merge based on one attribute like so:

    merge1 <- merge(orders, deliveries, by.x = c("order_date", "address"),
 by.y = c("date", "delivery_address"), sort = FALSE)

But now I want to match up those rows that didn't match up in merge1 by let's say zip code which has two different names in both columns ("zipcode" in one dataset and "postcode" in another).

I tried doing a left join on the orders and then finding the rows which return NA for some column in the deliveries dataset for merge1 and then tried doing another merge using that subset, but haven't been able to successfully do that.

merge1 <- merge(orders, deliveries, by.x = c("order_date", "address"),
     by.y = c("date", "delivery_address"), all.x = TRUE, sort = FALSE)

    merge2 <- merge(merge1[is.na(merge1$delivery_address),], deliveries, by.x = c("order_date", "zipcode"), 
by.y = c("date", "postcode"), all.x = TRUE, sort = FALSE)

I know that's totally wrong as it only returns me NA values and it duplicates the columns, but that was my train of thought.

Basically, just want a way to do a sequential merging in R between two datasets, first by one column, then by another, and so on and so forth. I don't want a left join though, an inner join where only the matching rows are returned, however, I could do a left join and then after all of the merging, select only the rows which don't have NA's. So my final result should be all the orders matched up with deliveries, but only the ones which matched up accordingly.

EDIT:

People asked for some example data, so here is some:

orders <- data.frame( order = c(1,2,3,4,5,6,7,8,9,10),
                      address = c(1111, 1112, 1314, 1113, 1114, 1618, 1917, 1118, 1945, 2000),
                      zipcode = c(001, 002, 001, 999, 999, 006, 007, 007, 999, 010))

deliveries <- data.frame(length = c(4, 5, 9, 11, 13, 15, 93, 17, 4, 8, 12), 
                         delivery_address = c(1111, 1112, 0111, 1113, 1114, 0000, 1618, 0001, 0002, 0405, 1121),
                         postcode = c(001, 912, 001, 910, 913, 006, 080, 007, 074, 088, 010))


merge1 <- merge(orders, deliveries, by.x = "address", by.y = "delivery_address", sort = FALSE)

So merge1 properly gives me the orders matched up with deliveries that had the same address, now how do I add to the merge1 dataset and add those rows which didn't get matched with the deliveries dataset so I can match them by postcode since there are still some orders and deliveries that can get matched by postcode.

Scransom · Accepted Answer

This works for your example data:

# functions used here use dplyr to process data
library("dplyr")

# using forward pipe syntax for readability of this example
# though this isn't necessary for functions to work
library("magrittr")

# merge by exact matches between address and delivery_address
# add column of delivery_address for binding later so dataframes match
merge1 <- orders %>%
  inner_join(y = deliveries,
             by = c("address" = "delivery_address")) %>%
  mutate(delivery_address = address)

# extract unmerged columns from orders then merge exact matches by
# zipcode to postcode.
# add postcode column for binding
merge2 <- orders %>%
  anti_join(y = deliveries,
            by = c("address" = "delivery_address")) %>%
  inner_join(y = deliveries,
             by = c("zipcode" = "postcode")) %>%
  mutate(postcode = zipcode)

# bind two sets of results together.
results <- bind_rows(merge1, merge2)
results

I highly recommend the RStudio cheat sheets on data transformation for this sort of work

How to do a sequential merge in R based on multiple columns in two same datasets

Answers (2)

Related Questions