Reputation: 85
I have data structured like this:
gene | sample1 | sample2 | sample3 | |
---|---|---|---|---|
1 | A | 0.2 | NA | NA |
2 | B | 0.1 | NA | NA |
3 | C | -0.3 | NA | NA |
4 | A | NA | -0.3 | NA |
5 | B | NA | 0.5 | NA |
6 | C | NA | NA | NA |
7 | A | NA | NA | NA |
8 | B | NA | NA | 0.2 |
9 | C | NA | NA | -0.5 |
I need to collapse the rows by gene
, keeping all values, including NA
, so that the final dataframe looks like:
gene | sample1 | sample2 | sample3 | |
---|---|---|---|---|
1 | A | 0.2 | -0.3 | NA |
2 | B | 0.1 | 0.5 | 0.2 |
3 | C | -0.3 | NA | -0.5 |
There are ~400 sample columns and ~5M rows. I've tried the following:
df %>% group_by(gene) %>% summarize(across(everything(), ~ ifelse(all(is.na(.x)), NA, max(.x, na.rm = FALSE))))
But I'm losing some data...any other dplyr solutions (or tweaks to summarize()
above)?
Upvotes: 0
Views: 69
Reputation: 24722
library(data.table)
setDT(df)[, lapply(.SD, \(s) max(s[!is.na(s)])), gene]
Output:
gene sample1 sample2 sample3
<char> <num> <num> <num>
1: A 0.2 -0.3 NA
2: B 0.1 0.5 0.2
3: C -0.3 NA -0.5
Upvotes: 1
Reputation: 18551
Your approach looks good to me, if you set na.rm = TRUE
. Can you provide an example where you actually loose data? I assume that you just want to keep the maximum value, at least, this is what you code suggests.
library(dplyr)
df %>%
group_by(gene) %>%
summarize(across(everything(),
~ ifelse(all(is.na(.x)),
NA,
max(.x, na.rm = TRUE)
)
)
)
#> # A tibble: 3 × 4
#> gene sample1 sample2 sample3
#> <chr> <dbl> <dbl> <dbl>
#> 1 A 0.2 -0.3 NA
#> 2 B 0.1 0.5 0.2
#> 3 C -0.3 NA -0.5
Data from OP
df <- tribble(~ gene, ~ sample1, ~ sample2, ~ sample3,
"A", 0.2 , NA , NA ,
"B", 0.1 , NA , NA ,
"C", -0.3 , NA , NA ,
"A", NA , -0.3 , NA ,
"B", NA , 0.5 , NA ,
"C", NA , NA , NA ,
"A", NA , NA , NA ,
"B", NA , NA , 0.2 ,
"C", NA , NA , -0.5 )
Created on 2023-03-08 with reprex v2.0.2
Upvotes: 2