eb0906
eb0906

Reputation: 85

data wrangling in R - collapse rows

I have data structured like this:

gene sample1 sample2 sample3
1 A 0.2 NA NA
2 B 0.1 NA NA
3 C -0.3 NA NA
4 A NA -0.3 NA
5 B NA 0.5 NA
6 C NA NA NA
7 A NA NA NA
8 B NA NA 0.2
9 C NA NA -0.5

I need to collapse the rows by gene, keeping all values, including NA, so that the final dataframe looks like:

gene sample1 sample2 sample3
1 A 0.2 -0.3 NA
2 B 0.1 0.5 0.2
3 C -0.3 NA -0.5

There are ~400 sample columns and ~5M rows. I've tried the following:

df %>% group_by(gene) %>% summarize(across(everything(), ~ ifelse(all(is.na(.x)), NA, max(.x, na.rm = FALSE))))

But I'm losing some data...any other dplyr solutions (or tweaks to summarize() above)?

Upvotes: 0

Views: 69

Answers (2)

langtang
langtang

Reputation: 24722

library(data.table)
setDT(df)[, lapply(.SD, \(s) max(s[!is.na(s)])), gene]

Output:

     gene sample1 sample2 sample3
   <char>   <num>   <num>   <num>
1:      A     0.2    -0.3      NA
2:      B     0.1     0.5     0.2
3:      C    -0.3      NA    -0.5

Upvotes: 1

TimTeaFan
TimTeaFan

Reputation: 18551

Your approach looks good to me, if you set na.rm = TRUE. Can you provide an example where you actually loose data? I assume that you just want to keep the maximum value, at least, this is what you code suggests.

library(dplyr)

df %>%
  group_by(gene) %>%
  summarize(across(everything(),
                   ~ ifelse(all(is.na(.x)),
                            NA,
                            max(.x, na.rm = TRUE)
                            )
                   )
            )
#> # A tibble: 3 × 4
#>   gene  sample1 sample2 sample3
#>   <chr>   <dbl>   <dbl>   <dbl>
#> 1 A         0.2    -0.3    NA  
#> 2 B         0.1     0.5     0.2
#> 3 C        -0.3    NA      -0.5

Data from OP

df <- tribble(~ gene, ~ sample1, ~ sample2, ~ sample3,

   "A",     0.2     , NA     , NA     ,  
   "B",     0.1     , NA     , NA     ,  
   "C",     -0.3    , NA     , NA     ,  
   "A",     NA      , -0.3   , NA     ,  
   "B",     NA      , 0.5    , NA     ,  
   "C",     NA      , NA     , NA     ,  
   "A",     NA      , NA     , NA     ,  
   "B",     NA      , NA     , 0.2    ,  
   "C",     NA      , NA     , -0.5 )

Created on 2023-03-08 with reprex v2.0.2

Upvotes: 2

Related Questions