llb1706
llb1706

Reputation: 45

Merge two datasets in a many to one framework, where dataset B's columns are a subset of dataset A's

My dataset A is a panel of individuals, as below. Many individuals can belong to a single family (variable fam_id). I have person variables and family variables in this dataset. As you can see, I have values for family variables for 2012 but not for 2013 and 2014.

person_id fam_id year personvar1 personvar2 familyvar1 familyvar 2
1 50 2012 500 stringA 1000 stringW
2 50 2012 550 stringB 1000 stringW
3 60 2012 710 stringC 150 stringX
4 70 2012 800 stringC 200 stringY
5 70 2012 0 stringJ 200 stringY
1 50 2013 120 stringJ NA NA
2 50 2013 370 stringK NA NA
3 60 2013 80 stringL NA NA
1 50 2014 100 stringM NA NA
2 50 2014 500 stringM NA NA
3 60 2014 300 stringO NA NA

Dataset B is a panel of the families in 2013. It has only family variables, which are the same as the ones in dataset A.

fam_id year familyvar1 familyvar2
50 2013 400 stringW
60 2013 700 stringY
70 2013 1000 stringZ

I'd like to 'fill' the NA family variables from 2013 in dataset A with the corresponding values of dataset B.

I've tried

merged_DT <- merge(A, B, by = c("fam_id", "year", "familyvar", "familyvar2") , all.x = TRUE)

But this somehow generates NAs instead of merging approprietely.

I also tried Merging two Dataframes in R by ID, One is the subset of the other approach but I get the error "object '..cols' not found".

How can I do this?

Upvotes: 0

Views: 34

Answers (2)

jay.sf
jay.sf

Reputation: 73572

The problem is to merge unique familyvars to your data. Currently in your main data frame d1, NA is already a level of e.g. familyvar1 in 2003. To get rid of such duplicates, first merge the existing levels to the new family vars data frame d2, and then merge it to d1 while excluding the old family vars. No need to employ extra packages for this.

> merge(d1[!grepl('^familyvar\\d$', names(d1))], 
+       unique(na.omit(merge(d1[c("fam_id", "year", "familyvar1", "familyvar2")], 
+                            d2, all=TRUE))),
+       all.x=TRUE)
   fam_id year person_id personvar1 personvar2 familyvar1 familyvar2
1      50 2012         1        500    stringA       1000    stringW
2      50 2012         2        550    stringB       1000    stringW
3      50 2013         1        120    stringJ        400    stringW
4      50 2013         2        370    stringK        400    stringW
5      50 2014         1        100    stringM         NA       <NA>
6      50 2014         2        500    stringM         NA       <NA>
7      60 2012         3        710    stringC        150    stringX
8      60 2013         3         80    stringL        700    stringY
9      60 2014         3        300    stringO         NA       <NA>
10     70 2012         4        800    stringC        200    stringY
11     70 2012         5          0    stringJ        200    stringY

Data:

> dput(d1)
structure(list(person_id = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 
1L, 2L, 3L), fam_id = c(50L, 50L, 60L, 70L, 70L, 50L, 50L, 60L, 
50L, 50L, 60L), year = c(2012L, 2012L, 2012L, 2012L, 2012L, 2013L, 
2013L, 2013L, 2014L, 2014L, 2014L), personvar1 = c(500L, 550L, 
710L, 800L, 0L, 120L, 370L, 80L, 100L, 500L, 300L), personvar2 = c("stringA", 
"stringB", "stringC", "stringC", "stringJ", "stringJ", "stringK", 
"stringL", "stringM", "stringM", "stringO"), familyvar1 = c(1000L, 
1000L, 150L, 200L, 200L, NA, NA, NA, NA, NA, NA), familyvar2 = c("stringW", 
"stringW", "stringX", "stringY", "stringY", NA, NA, NA, NA, NA, 
NA)), class = "data.frame", row.names = c(NA, -11L))
> dput(d2)
structure(list(fam_id = c(50L, 60L, 70L), year = c(2013L, 2013L, 
2013L), familyvar1 = c(400L, 700L, 1000L), familyvar2 = c("stringW", 
"stringY", "stringZ")), class = "data.frame", row.names = c(NA, 
-3L))

Upvotes: 1

Andre Wildberg
Andre Wildberg

Reputation: 19191

An approach that uses coalesce after a left_join to bring together the familyvars. If you include familyvars in the join-by they're not added if they don't match.

library(dplyr)

left_join(A, B, c("fam_id", "year")) %>% 
  mutate(familyvar1 = coalesce(familyvar1.x, familyvar1.y), 
         familyvar2 = coalesce(familyvar2.x, familyvar2.y)) %>% 
  select(-ends_with(c(".x", ".y")))
   person_id fam_id year personvar1 personvar2 familyvar1 familyvar2
1          1     50 2012        500    stringA       1000    stringW
2          2     50 2012        550    stringB       1000    stringW
3          3     60 2012        710    stringC        150    stringX
4          4     70 2012        800    stringC        200    stringY
5          5     70 2012          0    stringJ        200    stringY
6          1     50 2013        120    stringJ        400    stringW
7          2     50 2013        370    stringK        400    stringW
8          3     60 2013         80    stringL        700    stringY
9          1     50 2014        100    stringM         NA       <NA>
10         2     50 2014        500    stringM         NA       <NA>
11         3     60 2014        300    stringO         NA       <NA>

Data

A <- structure(list(person_id = c(1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L,
1L, 2L, 3L), fam_id = c(50L, 50L, 60L, 70L, 70L, 50L, 50L, 60L,
50L, 50L, 60L), year = c(2012L, 2012L, 2012L, 2012L, 2012L, 2013L,
2013L, 2013L, 2014L, 2014L, 2014L), personvar1 = c(500L, 550L,
710L, 800L, 0L, 120L, 370L, 80L, 100L, 500L, 300L), personvar2 = c("stringA",
"stringB", "stringC", "stringC", "stringJ", "stringJ", "stringK",
"stringL", "stringM", "stringM", "stringO"), familyvar1 = c(1000L,
1000L, 150L, 200L, 200L, NA, NA, NA, NA, NA, NA), familyvar2 = c("stringW",
"stringW", "stringX", "stringY", "stringY", NA, NA, NA, NA, NA,
NA)), class = "data.frame", row.names = c(NA, -11L))

B <- structure(list(fam_id = c(50L, 60L, 70L), year = c(2013L, 2013L,
2013L), familyvar1 = c(400L, 700L, 1000L), familyvar2 = c("stringW",
"stringY", "stringZ")), class = "data.frame", row.names = c(NA,
-3L))

EDIT, modified to address coalesceing multiple columns, using lapply to cycle through all familyvars and finally remove the initial x/y contributions from A and B, respectively.

res <- left_join(A, B, c("fam_id", "year"))

num_rng <- 1:2 # set to number of familyvars

data.frame(res, 
  setNames(data.frame(lapply(num_rng, \(x) 
    Reduce(coalesce, 
      res[,grep(paste0("familyvar", x), colnames(res), value=T)]))), 
    paste0("familyvar", num_rng))) %>% 
  select(-matches("familyvar\\d+\\.[xy]"))
   person_id fam_id year personvar1 personvar2 familyvar1 familyvar2
1          1     50 2012        500    stringA       1000    stringW
2          2     50 2012        550    stringB       1000    stringW
3          3     60 2012        710    stringC        150    stringX
4          4     70 2012        800    stringC        200    stringY
5          5     70 2012          0    stringJ        200    stringY
6          1     50 2013        120    stringJ        400    stringW
7          2     50 2013        370    stringK        400    stringW
8          3     60 2013         80    stringL        700    stringY
9          1     50 2014        100    stringM         NA       <NA>
10         2     50 2014        500    stringM         NA       <NA>
11         3     60 2014        300    stringO         NA       <NA>

Upvotes: 1

Related Questions