Merge two datasets in a many to one framework, where dataset B's columns are a subset of dataset A's

Question

My dataset A is a panel of individuals, as below. Many individuals can belong to a single family (variable fam_id). I have person variables and family variables in this dataset. As you can see, I have values for family variables for 2012 but not for 2013 and 2014.

person_id	fam_id	year	personvar1	personvar2	familyvar1	familyvar 2
1	50	2012	500	stringA	1000	stringW
2	50	2012	550	stringB	1000	stringW
3	60	2012	710	stringC	150	stringX
4	70	2012	800	stringC	200	stringY
5	70	2012	0	stringJ	200	stringY
1	50	2013	120	stringJ	NA	NA
2	50	2013	370	stringK	NA	NA
3	60	2013	80	stringL	NA	NA
1	50	2014	100	stringM	NA	NA
2	50	2014	500	stringM	NA	NA
3	60	2014	300	stringO	NA	NA

Dataset B is a panel of the families in 2013. It has only family variables, which are the same as the ones in dataset A.

fam_id	year	familyvar1	familyvar2
50	2013	400	stringW
60	2013	700	stringY
70	2013	1000	stringZ

I'd like to 'fill' the NA family variables from 2013 in dataset A with the corresponding values of dataset B.

I've tried

merged_DT <- merge(A, B, by = c("fam_id", "year", "familyvar", "familyvar2") , all.x = TRUE)

But this somehow generates NAs instead of merging approprietely.

I also tried Merging two Dataframes in R by ID, One is the subset of the other approach but I get the error "object '..cols' not found".

How can I do this?

Andre Wildberg · Accepted Answer

An approach that uses coalesce after a left_join to bring together the familyvars. If you include familyvars in the join-by they're not added if they don't match.

library(dplyr)

left_join(A, B, c("fam_id", "year")) %>% 
  mutate(familyvar1 = coalesce(familyvar1.x, familyvar1.y), 
         familyvar2 = coalesce(familyvar2.x, familyvar2.y)) %>% 
  select(-ends_with(c(".x", ".y")))
   person_id fam_id year personvar1 personvar2 familyvar1 familyvar2
1          1     50 2012        500    stringA       1000    stringW
2          2     50 2012        550    stringB       1000    stringW
3          3     60 2012        710    stringC        150    stringX
4          4     70 2012        800    stringC        200    stringY
5          5     70 2012          0    stringJ        200    stringY
6          1     50 2013        120    stringJ        400    stringW
7          2     50 2013        370    stringK        400    stringW
8          3     60 2013         80    stringL        700    stringY
9          1     50 2014        100    stringM         NA       
10         2     50 2014        500    stringM         NA       
11         3     60 2014        300    stringO         NA

Data

A <- structure(list(person_id = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L,
1L, 2L, 3L), fam_id = c(50L, 50L, 60L, 70L, 70L, 50L, 50L, 60L,
50L, 50L, 60L), year = c(2012L, 2012L, 2012L, 2012L, 2012L, 2013L,
2013L, 2013L, 2014L, 2014L, 2014L), personvar1 = c(500L, 550L,
710L, 800L, 0L, 120L, 370L, 80L, 100L, 500L, 300L), personvar2 = c("stringA",
"stringB", "stringC", "stringC", "stringJ", "stringJ", "stringK",
"stringL", "stringM", "stringM", "stringO"), familyvar1 = c(1000L,
1000L, 150L, 200L, 200L, NA, NA, NA, NA, NA, NA), familyvar2 = c("stringW",
"stringW", "stringX", "stringY", "stringY", NA, NA, NA, NA, NA,
NA)), class = "data.frame", row.names = c(NA, -11L))

B <- structure(list(fam_id = c(50L, 60L, 70L), year = c(2013L, 2013L,
2013L), familyvar1 = c(400L, 700L, 1000L), familyvar2 = c("stringW",
"stringY", "stringZ")), class = "data.frame", row.names = c(NA,
-3L))

EDIT, modified to address coalesceing multiple columns, using lapply to cycle through all familyvars and finally remove the initial x/y contributions from A and B, respectively.

res <- left_join(A, B, c("fam_id", "year"))

num_rng <- 1:2 # set to number of familyvars

data.frame(res, 
  setNames(data.frame(lapply(num_rng, \(x) 
    Reduce(coalesce, 
      res[,grep(paste0("familyvar", x), colnames(res), value=T)]))), 
    paste0("familyvar", num_rng))) %>% 
  select(-matches("familyvar\d+\.[xy]"))
   person_id fam_id year personvar1 personvar2 familyvar1 familyvar2
1          1     50 2012        500    stringA       1000    stringW
2          2     50 2012        550    stringB       1000    stringW
3          3     60 2012        710    stringC        150    stringX
4          4     70 2012        800    stringC        200    stringY
5          5     70 2012          0    stringJ        200    stringY
6          1     50 2013        120    stringJ        400    stringW
7          2     50 2013        370    stringK        400    stringW
8          3     60 2013         80    stringL        700    stringY
9          1     50 2014        100    stringM         NA       
10         2     50 2014        500    stringM         NA       
11         3     60 2014        300    stringO         NA

Merge two datasets in a many to one framework, where dataset B's columns are a subset of dataset A's

Answers (2)

Data

Related Questions

Merge two datasets in a many to one framework, where dataset B&#39;s columns are a subset of dataset A&#39;s

Answers (2)

Data

Related Questions

Merge two datasets in a many to one framework, where dataset B's columns are a subset of dataset A's