ritz
ritz

Reputation: 107

Reshape the data frame and consider 1 if 0 and 1 is present

I have a data frame that contain 500 rows and 20000 columns. rows contain different sample IDs and It have duplicate sample ID in rows but different column values. My data frame is like this .....

sample_name   E002.c1   E004.c1  E005.c1  E007.c1  so on...
T4456-C        0           0        0        0
T4456-C        1           0        0        1
T4456-C        1           0        1        1
T9589-C        0           1        0        0
T9589-C        1           1        0        0

Is there is a way to merge these ID like

If column contain 0 for all then consider value as 0 If column contain at least one 1 then consider value as 1.

Expected output :-

sample_name   E002.c1   E004.c1  E005.c1  E007.c1  so on...
T4456-C        1           0        1        1
T9589-C        1           1        0        0

Upvotes: 1

Views: 117

Answers (3)

Maurits Evers
Maurits Evers

Reputation: 50668

A base R option using aggregate and making use of the unary operator +

aggregate(. ~ sample_name, data = df, function(x) +(sum(x) > 0))
#  sample_name E002.c1 E004.c1 E005.c1 E007.c1
#1     T4456-C       1       0       1       1
#2     T9589-C       1       1       0       0

This avoids any explicit ifelse condition.


Sample data

df <- read.table(text =
    "sample_name   E002.c1   E004.c1  E005.c1  E007.c1
T4456-C        0           0        0        0
T4456-C        1           0        0        1
T4456-C        1           0        1        1
T9589-C        0           1        0        0
T9589-C        1           1        0        0", header = T)

Microbenchmark analysis

Below are the results of a microbenchmark analysis of all methods presented in this post using a larger dataset consisting of N=10^6 rows:

# Generate sample with 10^6 rows
N <- 10^6
df <- data.frame(
    sample_name = sample(letters[1:10], N, replace = T),
    col1 = sample(c(1, 0), N, replace = T),
    col2 = sample(c(1, 0), N, replace = T),
    col3 = sample(c(1, 0), N, replace = T),
    col4 = sample(c(1, 0), N, replace = T))

# Microbenchmark analysis
library(microbenchmark)
res <- microbenchmark(
    tidyverse_ifelse_any = {
        df %>% group_by(sample_name) %>% summarise_all(funs(ifelse(any(. == 1), 1, 0)))
    },
    tidyverse_max = {
        df %>% group_by(sample_name) %>% summarise_all(funs(max))
    },
    tidyverse_any_int_div = {
        df %>% group_by(sample_name) %>% summarise_all(funs(any(. %/% 1 == 1)*1))
    },
    tidyverse_mutate_if_ifelse = {
        df %>% group_by(sample_name) %>% summarise_all(sum) %>% mutate_if(is.numeric, funs(if_else(. > 0, 1, 0)))
    },
    baseR_ifelse_any = {
        aggregate(. ~ sample_name, data = df, function(x) ifelse(any(x == 1), 1, 0))
    },
    baseR_max = {
        aggregate(. ~ sample_name, data = df, max)
    },
    baseR_any_int_div = {
        aggregate(. ~ sample_name, data = df, function(x) any(x %/% 1 == 1)*1)
    },
    baseR_sum_unary_plus = {
        aggregate(. ~ sample_name, data = df, function(x) +(sum(x) > 0))
    },
    datatable_ifelse_any = {
        setDT(df)[, lapply(.SD, function(x) ifelse(any(x == 1), 1, 0)), by = sample_name]
    },
    datatable_any_int_div = {
        setDT(df)[, lapply(.SD, function(x) any(x %/% 1 == 1)*1), by = sample_name]
    }
)

res
#Unit: milliseconds
#                       expr        min         lq       mean     median
#       tidyverse_ifelse_any   79.54145   87.49671  101.44983   96.69517
#              tidyverse_max   60.85648   66.54888   75.71105   70.26009
#      tidyverse_any_int_div  130.17937  139.99099  158.74449  152.59370
# tidyverse_mutate_if_ifelse   60.63313   66.42935   75.17535   70.19083
#           baseR_ifelse_any  933.11576 1070.73916 1157.92271 1121.52533
#                  baseR_max  895.94086 1046.37304 1121.74497 1097.73445
#          baseR_any_int_div 1003.90893 1115.72278 1179.91529 1138.17459
#       baseR_sum_unary_plus  903.09797 1049.83542 1127.51391 1099.56222
#       datatable_ifelse_any   93.47955   97.21338  111.67774  100.98314
#      datatable_any_int_div  157.81882  164.51094  179.08096  173.94033
#         uq       max neval    cld
#  109.08000  259.4346   100 ab
#   80.39179  142.4100   100 a
#  166.56710  349.8669   100   c
#   76.91358  253.4256   100 a
# 1187.60461 1775.9125   100     ef
# 1167.16448 1544.4371   100    d
# 1218.67363 1592.0093   100      f
# 1196.53435 1375.8022   100    de
#  115.57745  282.7197   100  b
#  187.37031  317.1613   100   c

library(ggplot2)
autoplot(res)

Provided I didn't make a mistake, I'm surprised that

  1. the base R solutions are significantly slower than the tidyverse/data.table solutions (after all, tidyverse code is usually not about efficient code but about clean code), and
  2. the data.table solutions are not significantly faster than the tidyverse/base R solutions.

enter image description here

Upvotes: 3

tmfmnk
tmfmnk

Reputation: 39858

Also a tidyverse possibility:

df %>%
 group_by(sample_name) %>%
 summarise_all(funs(ifelse(any(. == 1), 1, 0)))

  sample_name E002.c1 E004.c1 E005.c1 E007.c1
  <fct>         <dbl>   <dbl>   <dbl>   <dbl>
1 T4456-C          1.      0.      1.      1.
2 T9589-C          1.      1.      0.      0.

It groups by "sample_name" and then checks whether any value == 1. If so, it assigns 1, otherwise 0.

Or the same with data.table:

setDT(df)[, lapply(.SD, function(x) ifelse(any(x == 1), 1, 0)), by = sample_name]

   sample_name E002.c1 E004.c1 E005.c1 E007.c1
1:     T4456-C       1       0       1       1
2:     T9589-C       1       1       0       0

Or with just the base R:

aggregate(. ~ sample_name, data = df, function(x) ifelse(any(x == 1), 1, 0))

  sample_name E002.c1 E004.c1 E005.c1 E007.c1
1     T4456-C       1       0       1       1
2     T9589-C       1       1       0       0

Or grouping by "sample_name" and then summarising the maximum value as proposed by @R Yoda:

df %>%
 group_by(sample_name) %>%
 summarise_all(funs(max))

The same with data.table:

setDT(df)[, lapply(.SD, max), by = sample_name]

And with the base R:

aggregate(. ~ sample_name, data = df, max)

Or using numerical divison:

df %>%
 group_by(sample_name) %>%
 summarise_all(funs(any(. %/% 1 == 1)*1))

The same with data.table:

setDT(df)[, lapply(.SD, function(x) any(x %/% 1 == 1)*1), by = sample_name]

And the base R:

aggregate(. ~ sample_name, data = df, function(x) any(x %/% 1 == 1)*1)

Upvotes: 4

Paweł Chabros
Paweł Chabros

Reputation: 2399

Try this:

library(tidyverse)

df %>%
  group_by(sample_name) %>%
  summarise_all(sum) %>%
  mutate_if(is.numeric, funs(if_else(. > 0, 1, 0)))

Upvotes: 3

Related Questions