Merging complex data.frame rows by a common identifier

Question

example <- structure(list(id = c(1, 1, 2, 2, 3, 3, 4), var1 = c(5, NA, 8, 
9, 10, NA, 3), var2 = c(23, NA, 8, 9, NA, NA, 6), var3 = c(NA, 
NA, NA, NA, NA, NA, NA), var4 = c(5, 5, 6, 6, 7, 7, 8)), row.names = c(NA, 
-7L), class = "data.frame")

I have a large and complex data frame where the same measure may be present more than once, and the different entries for the same measure may contain different data. I have created a simplified example above. What I would like to do is merge this down as much as possible. I think it's easiest for me to explain if I consult the example:

print(example)

  id var1 var2 var3 var4
1  1    5   23   NA    5
2  1   NA   NA   NA    5
3  2    8    8   NA    6
4  2    9    9   NA    6
5  3   10   NA   NA    7
6  3   NA   NA   NA    7
7  4    3    6   NA    8

If a variable is present in one of the rows with the ID, but is NA in another, then the non-NA value is always taken
If the value of the variable is the same across all instances of the ID, use this value
If two rows have the same ID but a different value for a single variable, then they are left as two different rows
If the variable is NA in all rows with some ID, then that ID will have NA for that variable in the final data frame

Thus, the desired output looks like:

  id var1 var2 var3 var4
1  1    5   23   NA    5
2  2    8    8   NA    6
3  2    9    9   NA    6
4  3   10   NA   NA    7
5  4    3    6   NA    8

There are similar questions on StackOverflow, but none of them deal with all of these features simultaneously. The closest I've got to this is:

example %>% 
   group_by(id) %>% 
   summarise_all(funs(list(na.omit(.))))

     id var1      var2      var3      var4     
                  
1     1    
2     2    
3     3    
4     4

Which essentially does what I want, but puts everything into lists, which hinders any downstream work with the dataframe. Instead of getting two rows for ID 2 I get a single row, which contains a list with 8 and 9 for var1, for example. This also occurs for matching variables: for ID1, var4 you get a list containing 5, 5.

Ronak Shah · Accepted Answer

You can do this across. For each column and id, you can keep only unique non-NA values,

library(dplyr)

example %>% 
  group_by(id) %>% 
  summarise(across(.fns = ~{x <- unique(na.omit(.));x[1:max(1, length(x))]}))
  #With summarise_all you can do
  #summarise_all(~{x <- unique(na.omit(.));x[1:max(1, length(x))]})


#    id  var1  var2 var3   var4
#      
#1     1     5    23 NA        5
#2     2     8     8 NA        6
#3     2     9     9 NA        6
#4     3    10    NA NA        7
#5     4     3     6 NA        8

Merging complex data.frame rows by a common identifier

Answers (1)

Related Questions