Reputation: 107
I have a data frame that contain 500 rows and 20000 columns. rows contain different sample IDs and It have duplicate sample ID in rows but different column values. My data frame is like this .....
sample_name E002.c1 E004.c1 E005.c1 E007.c1 so on...
T4456-C 0 0 0 0
T4456-C 1 0 0 1
T4456-C 1 0 1 1
T9589-C 0 1 0 0
T9589-C 1 1 0 0
Is there is a way to merge these ID like
If column contain 0 for all then consider value as 0 If column contain at least one 1 then consider value as 1.
Expected output :-
sample_name E002.c1 E004.c1 E005.c1 E007.c1 so on...
T4456-C 1 0 1 1
T9589-C 1 1 0 0
Upvotes: 1
Views: 117
Reputation: 50668
A base R option using aggregate
and making use of the unary operator +
aggregate(. ~ sample_name, data = df, function(x) +(sum(x) > 0))
# sample_name E002.c1 E004.c1 E005.c1 E007.c1
#1 T4456-C 1 0 1 1
#2 T9589-C 1 1 0 0
This avoids any explicit ifelse
condition.
df <- read.table(text =
"sample_name E002.c1 E004.c1 E005.c1 E007.c1
T4456-C 0 0 0 0
T4456-C 1 0 0 1
T4456-C 1 0 1 1
T9589-C 0 1 0 0
T9589-C 1 1 0 0", header = T)
Below are the results of a microbenchmark
analysis of all methods presented in this post using a larger dataset consisting of N=10^6
rows:
# Generate sample with 10^6 rows
N <- 10^6
df <- data.frame(
sample_name = sample(letters[1:10], N, replace = T),
col1 = sample(c(1, 0), N, replace = T),
col2 = sample(c(1, 0), N, replace = T),
col3 = sample(c(1, 0), N, replace = T),
col4 = sample(c(1, 0), N, replace = T))
# Microbenchmark analysis
library(microbenchmark)
res <- microbenchmark(
tidyverse_ifelse_any = {
df %>% group_by(sample_name) %>% summarise_all(funs(ifelse(any(. == 1), 1, 0)))
},
tidyverse_max = {
df %>% group_by(sample_name) %>% summarise_all(funs(max))
},
tidyverse_any_int_div = {
df %>% group_by(sample_name) %>% summarise_all(funs(any(. %/% 1 == 1)*1))
},
tidyverse_mutate_if_ifelse = {
df %>% group_by(sample_name) %>% summarise_all(sum) %>% mutate_if(is.numeric, funs(if_else(. > 0, 1, 0)))
},
baseR_ifelse_any = {
aggregate(. ~ sample_name, data = df, function(x) ifelse(any(x == 1), 1, 0))
},
baseR_max = {
aggregate(. ~ sample_name, data = df, max)
},
baseR_any_int_div = {
aggregate(. ~ sample_name, data = df, function(x) any(x %/% 1 == 1)*1)
},
baseR_sum_unary_plus = {
aggregate(. ~ sample_name, data = df, function(x) +(sum(x) > 0))
},
datatable_ifelse_any = {
setDT(df)[, lapply(.SD, function(x) ifelse(any(x == 1), 1, 0)), by = sample_name]
},
datatable_any_int_div = {
setDT(df)[, lapply(.SD, function(x) any(x %/% 1 == 1)*1), by = sample_name]
}
)
res
#Unit: milliseconds
# expr min lq mean median
# tidyverse_ifelse_any 79.54145 87.49671 101.44983 96.69517
# tidyverse_max 60.85648 66.54888 75.71105 70.26009
# tidyverse_any_int_div 130.17937 139.99099 158.74449 152.59370
# tidyverse_mutate_if_ifelse 60.63313 66.42935 75.17535 70.19083
# baseR_ifelse_any 933.11576 1070.73916 1157.92271 1121.52533
# baseR_max 895.94086 1046.37304 1121.74497 1097.73445
# baseR_any_int_div 1003.90893 1115.72278 1179.91529 1138.17459
# baseR_sum_unary_plus 903.09797 1049.83542 1127.51391 1099.56222
# datatable_ifelse_any 93.47955 97.21338 111.67774 100.98314
# datatable_any_int_div 157.81882 164.51094 179.08096 173.94033
# uq max neval cld
# 109.08000 259.4346 100 ab
# 80.39179 142.4100 100 a
# 166.56710 349.8669 100 c
# 76.91358 253.4256 100 a
# 1187.60461 1775.9125 100 ef
# 1167.16448 1544.4371 100 d
# 1218.67363 1592.0093 100 f
# 1196.53435 1375.8022 100 de
# 115.57745 282.7197 100 b
# 187.37031 317.1613 100 c
library(ggplot2)
autoplot(res)
Provided I didn't make a mistake, I'm surprised that
tidyverse
/data.table
solutions (after all, tidyverse
code is usually not about efficient code but about clean code), anddata.table
solutions are not significantly faster than the tidyverse
/base R solutions.Upvotes: 3
Reputation: 39858
Also a tidyverse
possibility:
df %>%
group_by(sample_name) %>%
summarise_all(funs(ifelse(any(. == 1), 1, 0)))
sample_name E002.c1 E004.c1 E005.c1 E007.c1
<fct> <dbl> <dbl> <dbl> <dbl>
1 T4456-C 1. 0. 1. 1.
2 T9589-C 1. 1. 0. 0.
It groups by "sample_name" and then checks whether any value == 1. If so, it assigns 1, otherwise 0.
Or the same with data.table
:
setDT(df)[, lapply(.SD, function(x) ifelse(any(x == 1), 1, 0)), by = sample_name]
sample_name E002.c1 E004.c1 E005.c1 E007.c1
1: T4456-C 1 0 1 1
2: T9589-C 1 1 0 0
Or with just the base R:
aggregate(. ~ sample_name, data = df, function(x) ifelse(any(x == 1), 1, 0))
sample_name E002.c1 E004.c1 E005.c1 E007.c1
1 T4456-C 1 0 1 1
2 T9589-C 1 1 0 0
Or grouping by "sample_name" and then summarising the maximum value as proposed by @R Yoda:
df %>%
group_by(sample_name) %>%
summarise_all(funs(max))
The same with data.table
:
setDT(df)[, lapply(.SD, max), by = sample_name]
And with the base R:
aggregate(. ~ sample_name, data = df, max)
Or using numerical divison:
df %>%
group_by(sample_name) %>%
summarise_all(funs(any(. %/% 1 == 1)*1))
The same with data.table
:
setDT(df)[, lapply(.SD, function(x) any(x %/% 1 == 1)*1), by = sample_name]
And the base R:
aggregate(. ~ sample_name, data = df, function(x) any(x %/% 1 == 1)*1)
Upvotes: 4
Reputation: 2399
Try this:
library(tidyverse)
df %>%
group_by(sample_name) %>%
summarise_all(sum) %>%
mutate_if(is.numeric, funs(if_else(. > 0, 1, 0)))
Upvotes: 3