Garp
Garp

Reputation: 419

Merge 2 data frames by multiple columns, keep a row if there is a match in at least one column

I have 2 data frames df_1 and df_2. They share 3 cols in common: permno, cusip, and ticker. Each row of df_1 is a unique stock. The permno, cusip, and ticker in df_1 are used to identify the stock return in df_2. Sometimes one or two of these these variables are not available, but in each row at least one of the three is available. And I will use that value to look up the return in df_2.

Can you suggest any (fast) way to merge df_1 and df_2 if there is a match in at least one of three columns permno, cusip, or ticker.

df_1

id  permno  cusip  ticker
1   1       11     AA
2   NA      12     NA
3   2       13     NA
4   5       NA     NA

df_2

permno  cusip  ticker  return  date
1       11     NA      100     date_1
7       15     BX      102     date_2
2       NA     CU      103     date_3

Desired result

id  permno  cusip  ticker  return  date
1   1       11     AA      100     date_1
1   1       11     NA      100     date_1
3   2       13     NA      103     date_3
3   2       NA     CU      103     date_3

Upvotes: 2

Views: 1188

Answers (1)

Ricky
Ricky

Reputation: 4686

This should work.

# define common columns in both data frames 
colmatch <- c("permno", "cusip", "ticker")

# function to trim down data frame A to just those with rows
# that have at least one match in common column with data frame B
# and append columns from B which are not found in A
simplify <- function(df1, df2, col = colmatch) {
  # find all common column elements that matches
  idx <- sapply(col, function(x)
    match(df1[[x]], df2[[x]], incomparables=NA)
  )

  # find rows in first data frame with at least one match
  idx1 <- which(apply(idx, 1, function(x) !all(is.na(x))))

  # find corresponding rows in second data frame
  idx2 <- apply(idx[idx1, ], 1, function(x) x[min(which(!is.na(x)))])

  # copy columns from second data frame to first data frame
  # only for rows which matches above
  dff <- cbind(df1[idx1, ], df2[idx2, !(names(df2) %in% colmatch), drop=F])
}


# assemble the final output
df_final <- rbind(simplify(df_1, df_2),  # find df_1 rows with matches in df_2
                  simplify(df_2, df_1))  # and vice versa

The final output (if you prefer it sorted by id)

> df_final[order(df_final$id), ]
   id permno cusip ticker return   date
1   1      1    11     AA    100 date_1
11  1      1    11   <NA>    100 date_1
3   3      2    13   <NA>    103 date_3
31  3      2    NA     CU    103 date_3

Upvotes: 1

Related Questions