CPak
CPak

Reputation: 13581

combine rows in data frame containing NA to make complete row

I know this is a duplicate Q but I can't seem to find the post again

Using the following data

df <- data.frame(A=c(1,1,2,2),B=c(NA,2,NA,4),C=c(3,NA,NA,5),D=c(NA,2,3,NA),E=c(5,NA,NA,4))

  A  B  C  D  E
  1 NA  3 NA  5
  1  2 NA  2 NA
  2 NA NA  3 NA
  2  4  5 NA  4

Grouping by A, I'd like the following output using a tidyverse solution

  A  B  C  D  E
  1  2  3  2  5
  2  4  5  3  4

I have many groups in A. I think I saw an answer using coalesce but am unsure how to get it work. I'd like a solution that works with characters as well. Thanks!

Upvotes: 34

Views: 13406

Answers (7)

AndrewGB
AndrewGB

Reputation: 16856

A simple way is to summarise by taking the max across all columns for each group with tidyverse:

library(tidyverse)

df %>% 
  group_by(A) %>% 
  summarise(across(everything(), ~ max(., na.rm = T)))

Output

      A     B     C     D     E
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     2     3     2     5
2     2     4     5     3     4

Another option with using na.omit and unique:

df %>% 
  group_by(A) %>% 
  summarise(across(everything(), ~ na.omit(unique(.))))

Upvotes: 2

www
www

Reputation: 39154

We can use fill to fill all the missing values. And then filter just one row for each group.

library(dplyr)
library(tidyr)

df2 <- df %>%
  group_by(A) %>%
  fill(everything(), .direction = "down") %>%
  fill(everything(), .direction = "up") %>%
  slice(1)

And thanks to @Roger-123, the above code can be further simplified as follows.

df2 <- df %>%
  group_by(A) %>%
  fill(everything(), .direction = "downup") %>%
  slice(1)

Upvotes: 19

Marcus Lehr
Marcus Lehr

Reputation: 57

This is functionally identical to @Oriol Mirosa's answer without requiring a custom function:

EDIT: NAs must be omitted as per @thelatemail's comment. This answer was also given by @MrFlick in the duplicate thread linked above.

df %>% group_by(A) %>% summarise_all(~first(na.omit(.)))

I wanted to add to this as it seems to come up regularly for me and I've revisited this thread many times. @Oriol Mirosa's answer works, however I'm resistant to it because it's just complex enough to be difficult to remember (hence my return to this thread).

Personally, I also don't like writing small custom functions like if I don't need to. Attempting to substitute coalesce_by_column with the actual coalesce call results in type errors (which I find strange as the rows aren't interacting with each other but whatever). This can be resolved by first doing mutate_all(as.character), however my goal here is to minimize syntax so it's easily remembered on the fly.

Furthermore, this substitution changes the behavior such that non-identical values within a column throws an error (why things sometimes behave slightly differently within a function is beyond me). This behavior may be preferred in some situations, however in that case I would recommend @Jerry T's solution as there is no custom function and the ones used are familiar, readable, and the ordering of them (na.omit and unique) isn't relevant.

Upvotes: 1

tmfmnk
tmfmnk

Reputation: 39858

A different tidyverse possibility could be:

df %>%
 gather(var, val, -A, na.rm = TRUE) %>%
 group_by(A, var) %>%
 distinct(val) %>%
 spread(var, val)

      A     B     C     D     E
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     2     3     2     5
2     2     4     5     3     4

Here it, first, performs a wide-to-long data-transformation, excluding the "A" column and removing the missing values. Second, it groups by "A" column and the variable names. Third, it removes the duplicate values. Finally, it returns the data to its original wide format.

Upvotes: 1

Oriol Mirosa
Oriol Mirosa

Reputation: 2826

I haven't figured out how to put the coalesce_by_column function inside the dplyr pipeline, but this works:

coalesce_by_column <- function(df) {
  return(coalesce(df[1], df[2]))
}

df %>%
  group_by(A) %>%
  summarise_all(coalesce_by_column)

##       A     B     C     D     E
##   <dbl> <dbl> <dbl> <dbl> <dbl>
## 1     1     2     3     2     5
## 2     2     4     5     3     4

Edit: include @Jon Harmon's solution for more than 2 members of a group

# Supply lists by splicing them into dots:
coalesce_by_column <- function(df) {
  return(dplyr::coalesce(!!! as.list(df)))
}

df %>%
  group_by(A) %>%
  summarise_all(coalesce_by_column)

#> # A tibble: 2 x 5
#>       A     B     C     D     E
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1     2     3     2     5
#> 2     2     4     5     3     4

Upvotes: 25

Jerry T
Jerry T

Reputation: 1690

Here is an even more general solution (using unique, na.omit to sort of create coalesce), which can handle more than two rows with overlapping information. Super simply and forward.

> df <- data.frame(A=c(1,1,2,2,2),B=c(NA,2,NA,4,4),C=c(3,NA,NA,5,NA),D=c(NA,2,3,NA,NA),E=c(5,NA,NA,4,4))

> df
  A  B  C  D  E
1 1 NA  3 NA  5
2 1  2 NA  2 NA
3 2 NA NA  3 NA
4 2  4  5 NA  4
5 2  4 NA NA  4

> df %>% group_by(A) %>% summarise_all(funs( na.omit(unique(.)) ))
# A tibble: 2 x 5
      A     B     C     D     E
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     2     3     2     5
2     2     4     5     3     4

Upvotes: 4

d.b
d.b

Reputation: 32548

Not tidyverse but here's one base R solution

df <- data.frame(A=c(1,1),B=c(NA,2),C=c(3,NA),D=c(NA,2),E=c(5,NA))
sapply(df, function(x) x[!is.na(x)][1])
#A B C D E 
#1 2 3 2 5 

With updated data

do.call(rbind, lapply(split(df, df$A), function(a) sapply(a, function(x) x[!is.na(x)][1])))
#  A B C D E
#1 1 2 3 2 5
#2 2 4 5 3 4

Upvotes: 5

Related Questions