Ic3MaN911
Ic3MaN911

Reputation: 261

How to do a sequential merge in R based on multiple columns in two same datasets

I need to perform a sequential merging in R and what I mean by this is that let's say I have two datasets: orders and deliveries.

I want to match up these orders and deliveries together but I first want to merge based on the address column, then for the rows that don't match up, I want to merge based on zip code, then for those rows that don't match up, I want to merge based on latitude and longitude, then for those rows that don't match up, I want to merge on some other attribute and so forth.

I can easily do a merge based on one attribute like so:

    merge1 <- merge(orders, deliveries, by.x = c("order_date", "address"),
 by.y = c("date", "delivery_address"), sort = FALSE)

But now I want to match up those rows that didn't match up in merge1 by let's say zip code which has two different names in both columns ("zipcode" in one dataset and "postcode" in another).

I tried doing a left join on the orders and then finding the rows which return NA for some column in the deliveries dataset for merge1 and then tried doing another merge using that subset, but haven't been able to successfully do that.

merge1 <- merge(orders, deliveries, by.x = c("order_date", "address"),
     by.y = c("date", "delivery_address"), all.x = TRUE, sort = FALSE)

    merge2 <- merge(merge1[is.na(merge1$delivery_address),], deliveries, by.x = c("order_date", "zipcode"), 
by.y = c("date", "postcode"), all.x = TRUE, sort = FALSE)

I know that's totally wrong as it only returns me NA values and it duplicates the columns, but that was my train of thought.

Basically, just want a way to do a sequential merging in R between two datasets, first by one column, then by another, and so on and so forth. I don't want a left join though, an inner join where only the matching rows are returned, however, I could do a left join and then after all of the merging, select only the rows which don't have NA's. So my final result should be all the orders matched up with deliveries, but only the ones which matched up accordingly.

EDIT:

People asked for some example data, so here is some:

orders <- data.frame( order = c(1,2,3,4,5,6,7,8,9,10),
                      address = c(1111, 1112, 1314, 1113, 1114, 1618, 1917, 1118, 1945, 2000),
                      zipcode = c(001, 002, 001, 999, 999, 006, 007, 007, 999, 010))

deliveries <- data.frame(length = c(4, 5, 9, 11, 13, 15, 93, 17, 4, 8, 12), 
                         delivery_address = c(1111, 1112, 0111, 1113, 1114, 0000, 1618, 0001, 0002, 0405, 1121),
                         postcode = c(001, 912, 001, 910, 913, 006, 080, 007, 074, 088, 010))


merge1 <- merge(orders, deliveries, by.x = "address", by.y = "delivery_address", sort = FALSE)

So merge1 properly gives me the orders matched up with deliveries that had the same address, now how do I add to the merge1 dataset and add those rows which didn't get matched with the deliveries dataset so I can match them by postcode since there are still some orders and deliveries that can get matched by postcode.

Upvotes: 3

Views: 1352

Answers (2)

Scransom
Scransom

Reputation: 3335

This works for your example data:

# functions used here use dplyr to process data
library("dplyr")

# using forward pipe syntax for readability of this example
# though this isn't necessary for functions to work
library("magrittr")

# merge by exact matches between address and delivery_address
# add column of delivery_address for binding later so dataframes match
merge1 <- orders %>%
  inner_join(y = deliveries,
             by = c("address" = "delivery_address")) %>%
  mutate(delivery_address = address)

# extract unmerged columns from orders then merge exact matches by
# zipcode to postcode.
# add postcode column for binding
merge2 <- orders %>%
  anti_join(y = deliveries,
            by = c("address" = "delivery_address")) %>%
  inner_join(y = deliveries,
             by = c("zipcode" = "postcode")) %>%
  mutate(postcode = zipcode)

# bind two sets of results together.
results <- bind_rows(merge1, merge2)
results

I highly recommend the RStudio cheat sheets on data transformation for this sort of work

Upvotes: 3

Parfait
Parfait

Reputation: 107587

Consider merging by all and row binding each, then drop duplicates with unique():

merge1 <- unique(rbind(transform(merge(orders, deliveries, by.x = "address", by.y = "delivery_address", sort = FALSE),
                                 delivery_address = address),
                       transform(merge(orders, deliveries, by.x = "zipcode", by.y = "postcode", sort = FALSE),
                                 postcode = zipcode)))

#    address order zipcode length postcode delivery_address
# 1     1111     1       1      4        1             1111
# 2     1112     2       2      5      912             1112
# 3     1113     4     999     11      910             1113
# 4     1114     5     999     13      913             1114
# 5     1618     6       6     93       80             1618
# 6     1314     3       1      9        1              111
# 7     1314     3       1      4        1             1111
# 8     1111     1       1      9        1              111
# 10    1618     6       6     15        6                0
# 11    1917     7       7     17        7                1
# 12    1118     8       7     17        7                1
# 13    2000    10      10     12       10             1121

And for a generalizable solution across multiple columns use Map() and do.call() on a user-defined function, seqmerge, where you extend the xvar and yvar to pairings of merge columns. Be sure both are same length.

seqmerge <- function(xvar, yvar) {
  df <- merge(orders, deliveries, by.x = xvar, by.y = yvar, sort = FALSE)
  df[[yvar]] = df[[xvar]]
  return(df)
}

xvars <- c("address", "zipcode")               # ADD MORE AS NEEDED
yvars <- c("delivery_address", "postcode")     # ADD MORE AS NEEDED

merge2 <- unique(do.call(rbind, Map(seqmerge, xvars, yvars, USE.NAMES=FALSE)))

all.equal(merge1, merge2)
# [1] TRUE

identical(merge1, merge2)
# [1] TRUE

Upvotes: 0

Related Questions