Reputation: 23
I have multiple data frames (around 20, here only shown 3 simplified ones) with some overlapping column names. The key variable is "id" and I would like to merge the data frames based on this key variable. There should not be created extra columns. On the other hand I would like to avoid duplicate rows, such that rows with the same key are combined to fill in as much as empty fields as possible while having only one row per key.
I already tried "rbind.fill" but while it fills in correctly the columns, it creates duplicate rows. On the other hand if I try "merge" it gives me back an empty data frame.
df1<- cbind.data.frame(id=c(1,2,3,4), price= c(15,16,20,25), color= c("Black", NA, "White", "Green"), weight= c(5,6,10,12))
df2<- cbind.data.frame(id=c(3,4,5,6), price=c(NA, NA, 23,30), weight=c(10,12,NA, NA), battery= c("low", "high", NA, NA))
df3<- cbind.data.frame(id=c(5,6,7,8), weight= c(NA, 15,17,NA), battery= c("low", "high","high", NA), surface= c(100,115,NA, NA))
df_list <- list(df1,df2,df3)
df5<-Reduce(function(d1, d2) merge(d1, d2, by = "id"),df_list)
library(plyr)
df6 <- rbind.fill(df1,df2,df3)
I Expect the output to be like this data frame:
df4 <- cbind.data.frame(id=c(1,2,3,4,5,6,7,8), price= c(15,16,20,25,23,30,17,NA),color= c("Black", NA, "White", "Green", NA, NA, NA, NA),weight= c(5,6,10,12,NA, 15,NA,NA), battery= c(NA, NA,"low", "high","low", "high","high", NA), surface= c(NA, NA, NA, NA,100,115,NA, NA))
Upvotes: 2
Views: 1033
Reputation: 3080
With dplyr
:
df_list <- list(df1,df2,df3)
library(dplyr)
bind_rows(df_list) %>%
group_by(id) %>%
summarise_all(~first(na.omit(.)))
I'm assuming that non-missing fields will match across the data frames, and therefore just pick the first one observed.
Result
# A tibble: 8 x 6
id price color weight battery surface
<dbl> <dbl> <fct> <dbl> <fct> <dbl>
1 1 15 Black 5 NA NA
2 2 16 NA 6 NA NA
3 3 20 White 10 low NA
4 4 25 Green 12 high NA
5 5 23 NA NA low 100
6 6 30 NA 15 high 115
7 7 NA NA 17 high NA
8 8 NA NA NA NA NA
Upvotes: 3