Learner
Learner

Reputation: 757

how can I combine based on one column and make them unique at the same time

I have many data sets that I want to merge them and make them unique. I am trying to make a representative data here

df1 <- read.table(text="info   var1 var2
1       C001        mytest1       NA
2       C002        mytest2       NA
3       C003  myse1        data1
4       C004        NA       NA
5       C007 where1        India
6       C010 ohio        city
11      C016 number        fifty
12      C017 city        rome", header=T, stringsAsFactors=F)

and this

df2 <- read.table(text="info   var1  var2
1      C003 myse1 data1
2      C007 where1 India
3      C010 ohio city
4      C016 number        fifty
5      C017 city        rome
6      C022 country India
7      C023 number 10", header=T, stringsAsFactors=F)

df3 <- read.table(text="info   var1  var2 var3
1      C017 city        rome  ind
2      C022 country India     bes
3      C027 this  there  NA", header=T, stringsAsFactors=F)

I want to combine them all together bases on info but make them unique. when I want to combine all files, I do this

library(tidyverse)
library(dplyr)
list(df1, df2, df3) %>% reduce(full_join, by = "info")

but I want the output to be like this

info    var1.x  var2.x  var3
C001    mytest1 NA  NA
C002    mytest2 NA  NA
C003    myse1   data1   NA
C004    NA  NA  NA
C007    where1  India   NA
C010    ohio    city    NA
C016    number  fifty   NA
C017    city    rome    ind
C022    country India   bes
C023    number  10  NA
C027    this    there   NA

Upvotes: 0

Views: 45

Answers (2)

hmhensen
hmhensen

Reputation: 3195

I think this should work for you.

bind_rows(df1, df2, df3) %>% 
  unique() %>% 
  mutate(rsum = rowSums(!is.na(.))) %>%
  group_by(info) %>%  
  filter(rsum == max(rsum)) %>% 
  select(-rsum)

   info  var1    var2  var3 
   <chr> <chr>   <chr> <chr>
 1 C001  mytest1 <NA>  <NA> 
 2 C002  mytest2 <NA>  <NA> 
 3 C003  myse1   data1 <NA> 
 4 C004  <NA>    <NA>  <NA> 
 5 C007  where1  India <NA> 
 6 C010  ohio    city  <NA> 
 7 C016  number  fifty <NA> 
 8 C023  number  10    <NA> 
 9 C017  city    rome  ind  
10 C022  country India bes  
11 C027  this    there <NA> 

Upvotes: 1

Soren
Soren

Reputation: 2425

The following solution first generates your unique keys by which you seek to merge your datasets together, the shared "info" column. Then using a left join merge to add the respective columns from var1 in df1 and df2, var2 in df1 and df2, and var3 in df3

library(dplyr)
info <- data.frame(info=unique(c(df1$info,df2$info,df3$info)))
var1s <- unique(rbind(df1[,c("info","var1")],df2[,c("info","var1")],df3[,c("info","var1")]))
var2s <- unique(rbind(df1[,c("info","var2")],df2[,c("info","var2")],df3[,c("info","var2")]))
var3s <- unique(df3[,c("info","var3")])
merge(x=info,y=var1s,by="info",all.x=T) %>% merge(y=var2s,by="info",all.x=T) %>% merge(y=var3s,by="info",all.x=T)

Result:

> merge(x=info,y=var1s,by="info",all.x=T) %>% merge(y=var2s,by="info",all.x=T) %>% merge(y=var3s,by="info",all.x=T)
   info    var1  var2 var3
1  C001 mytest1  <NA> <NA>
2  C002 mytest2  <NA> <NA>
3  C003   myse1 data1 <NA>
4  C004    <NA>  <NA> <NA>
5  C007  where1 India <NA>
6  C010    ohio  city <NA>
7  C016  number fifty <NA>
8  C017    city  rome  ind
9  C022 country India  bes
10 C023  number    10 <NA>
11 C027    this there <NA>

Upvotes: 0

Related Questions