data wrangling in R - collapse rows

Question

I have data structured like this:

	gene	sample1	sample2	sample3
1	A	0.2	NA	NA
2	B	0.1	NA	NA
3	C	-0.3	NA	NA
4	A	NA	-0.3	NA
5	B	NA	0.5	NA
6	C	NA	NA	NA
7	A	NA	NA	NA
8	B	NA	NA	0.2
9	C	NA	NA	-0.5

I need to collapse the rows by gene, keeping all values, including NA, so that the final dataframe looks like:

	gene	sample1	sample2	sample3
1	A	0.2	-0.3	NA
2	B	0.1	0.5	0.2
3	C	-0.3	NA	-0.5

There are ~400 sample columns and ~5M rows. I've tried the following:

df %>% group_by(gene) %>% summarize(across(everything(), ~ ifelse(all(is.na(.x)), NA, max(.x, na.rm = FALSE))))

But I'm losing some data...any other dplyr solutions (or tweaks to summarize() above)?

TimTeaFan · Accepted Answer

Your approach looks good to me, if you set na.rm = TRUE. Can you provide an example where you actually loose data? I assume that you just want to keep the maximum value, at least, this is what you code suggests.

library(dplyr)

df %>%
  group_by(gene) %>%
  summarize(across(everything(),
                   ~ ifelse(all(is.na(.x)),
                            NA,
                            max(.x, na.rm = TRUE)
                            )
                   )
            )
#> # A tibble: 3 × 4
#>   gene  sample1 sample2 sample3
#>            
#> 1 A         0.2    -0.3    NA  
#> 2 B         0.1     0.5     0.2
#> 3 C        -0.3    NA      -0.5

Data from OP

df <- tribble(~ gene, ~ sample1, ~ sample2, ~ sample3,

   "A",     0.2     , NA     , NA     ,  
   "B",     0.1     , NA     , NA     ,  
   "C",     -0.3    , NA     , NA     ,  
   "A",     NA      , -0.3   , NA     ,  
   "B",     NA      , 0.5    , NA     ,  
   "C",     NA      , NA     , NA     ,  
   "A",     NA      , NA     , NA     ,  
   "B",     NA      , NA     , 0.2    ,  
   "C",     NA      , NA     , -0.5 )

^{Created on 2023-03-08 with reprex v2.0.2}

data wrangling in R - collapse rows

Answers (2)

Related Questions