Reputation: 63

How to select individual rows from duplicates based on the highest median in R?

I have a dataframe containing gene expression data that looks like the following:

row.names     symbol     Sample1     Sample2     Sample3     Sample4
Probe1        Gene1      1.5         2.8         1.8         3.2
Probe2        Gene2      2.7         4.5         3.2         5.1
Probe3        Gene3      1.1         4.7         2.3         5.3
Probe4        Gene2      1.2         0.9         0.8         1.1
Probe5        Gene1      3.1         6.1         6.2         4.2

I want to subset the data so that only unique genes remain, and in each case the probe with the highest median will be retained i.e. the data above would become the following:

row.names     symbol     Sample1     Sample2     Sample3     Sample4
Probe2        Gene2      2.7         4.5         3.2         5.1
Probe3        Gene3      1.1         4.7         2.3         5.3
Probe5        Gene1      3.1         6.1         6.2         4.2

The dataframe has ~40,000 individual probes and ~100 samples.

Does anyone have any idea which commands in R are suitable for the task?

Upvotes: 2

Answers (2)

David Arenburg

Reputation: 92292

I wouldn't calculate medians by row, rather use the vectorized rowMedians function from the matrixStats package for that. Then, I would reorder by the result and select unique entries using the data.table package

library(data.table)
library(matrixStats)
df$Medians <- rowMedians(as.matrix(df[-(1:2)]))
unique(setDT(df)[order(-Medians)], by = "symbol")
#    row.names symbol Sample1 Sample2 Sample3 Sample4 Medians
# 1:    Probe5  Gene1     3.1     6.1     6.2     4.2    5.15
# 2:    Probe2  Gene2     2.7     4.5     3.2     5.1    3.85
# 3:    Probe3  Gene3     1.1     4.7     2.3     5.3    3.50

Some benchmarks

library(data.table)
library(matrixStats)
library(dplyr)

set.seed(123)
bigdf <- data.frame(A = paste0("Probe", 1:1e5),
                    symbol = paste0("Gene", sample(1e2, 1e5, replace = TRUE)),
                    matrix(sample(1e2, 1e6, replace = TRUE), ncol = 100))
bigdf2 <- copy(bigdf)
bigdf3 <- copy(bigdf2)

system.time({
  bigdf$Medians <- rowMedians(as.matrix(bigdf[-(1:2)]))
  unique(setDT(bigdf)[order(-Medians)], by = "symbol")
  })

# user  system elapsed 
# 0.22    0.05    0.26 

system.time(setDT(bigdf2)[,.SD[which.max(apply(.SD[,-(1:2), with = FALSE], 1, median)),], by = symbol])
# user  system elapsed 
# 5.17    0.01    5.33 
system.time({
              bigdf3$medianCol <- apply(bigdf3[-(1:2)],1,FUN = median)
              grouped_df <- group_by(bigdf3,symbol)
              filtered_df <- filter(grouped_df, medianCol == max(medianCol))
})
# user  system elapsed 
# 5.15    0.00    5.15

Upvotes: 3

Wannes Rosiers

Reputation: 1690

Or using dplyr:

library(dplyr)
df$medianCol <- apply(df[,2:5],1,FUN = median)
grouped_df <- group_by(df,symbol)
filtered_df <- filter(grouped_df, medianCol == max(medianCol))
filtered_df$medianCol <- NULL

Upvotes: 1

How to select individual rows from duplicates based on the highest median in R?

Answers (2)

Related Questions