Sol
Sol

Reputation: 754

How to merge 2 data frames filling out missing values in R?

I have been using dplyr and the mergefunction for a while, but I just ran into a task that I can't figure out how to do in a simple way. I would like to merge two data frames. The second data frame has some columns missing. I want the output of merge to have these values filled out with the corresponding values of the first data frame.

The data frames look like this:

#df1
participant = c(1,2,3)
age         = c(18,23,19)
gender      = c(rep("female",2),"male")
experiment  = rep("noun",3)
rt          = c(218,111,546)
df1         = data.frame(participant, age, gender,experiment,rt)

  participant age gender experiment  rt
1           1  18 female       noun 218
2           2  23 female       noun 111
3           3  19   male       noun 546

#df2
participant = c(1,2,3)
experiment  = rep("verb",3)
rt          = c(238,251,140)
df2         = data.frame(participant, experiment,rt)

  participant experiment  rt
1           1       verb 238
2           2       verb 251
3           3       verb 140

This is how the output data frame should look like:

  participant age gender experiment  rt
1           1  18 female       noun 218
2           2  23 female       noun 111
3           3  19   male       noun 546
4           1  18 female       verb 238
5           2  23 female       verb 251
6           3  19   male       verb 140

The goal is that df2 is merged using the column participant, and the values for the columns age and gender are taken from df1. I have tried many things (all the types of join, rbind.fill, etc) and I have also looked around in the forum, but I can't find a simple way to achieve binding the rows of the two data frames and filling out missing values in df2 using a specific column as a reference. In my example, there are only 2 columns to be filled out (age and gender), but my real df1 has many more, so I would like to avoid specifying them by hand.

If any of you data merging wizards have a suggestion, I would be very grateful!

Upvotes: 2

Views: 2381

Answers (1)

akrun
akrun

Reputation: 887961

We place the datasets in a list, use rbindlist with fill=TRUE (from data.table), grouped by 'participant, we assign (:=) the 'age' and 'gender' as the non-NA values in that column.

library(data.table)
rbindlist(list(df1, df2), fill=TRUE)[, c('age', 'gender') :=
       lapply(.SD, na.omit) , participant, .SDcols=age:gender][]
#   participant age gender experiment  rt
#1:           1  18 female       noun 218
#2:           2  23 female       noun 111
#3:           3  19   male       noun 546
#4:           1  18 female       verb 238
#5:           2  23 female       verb 251
#6:           3  19   male       verb 140

Or using only base R, we rbind the subset of columns that are common in both the datasets, and then create the additional columns based on the columns that are found in one dataset and not in other.

i1 <- intersect(names(df1), names(df2))
res <-  rbind(df1[i1], df2[i1])
nm1 <- setdiff(names(df1),i1)
res[nm1] <- df1[nm1]
res
#    participant experiment  rt age gender
#1           1       noun 218  18 female
#2           2       noun 111  23 female
#3           3       noun 546  19   male
#4           1       verb 238  18 female
#5           2       verb 251  23 female
#6           3       verb 140  19   male

Upvotes: 4

Related Questions