Reputation: 63
I have a dataframe containing gene expression data that looks like the following:
row.names symbol Sample1 Sample2 Sample3 Sample4
Probe1 Gene1 1.5 2.8 1.8 3.2
Probe2 Gene2 2.7 4.5 3.2 5.1
Probe3 Gene3 1.1 4.7 2.3 5.3
Probe4 Gene2 1.2 0.9 0.8 1.1
Probe5 Gene1 3.1 6.1 6.2 4.2
I want to subset the data so that only unique genes remain, and in each case the probe with the highest median will be retained i.e. the data above would become the following:
row.names symbol Sample1 Sample2 Sample3 Sample4
Probe2 Gene2 2.7 4.5 3.2 5.1
Probe3 Gene3 1.1 4.7 2.3 5.3
Probe5 Gene1 3.1 6.1 6.2 4.2
The dataframe has ~40,000 individual probes and ~100 samples.
Does anyone have any idea which commands in R are suitable for the task?
Upvotes: 2
Views: 84
Reputation: 92292
I wouldn't calculate medians by row, rather use the vectorized rowMedians
function from the matrixStats
package for that. Then, I would reorder by the result and select unique entries using the data.table
package
library(data.table)
library(matrixStats)
df$Medians <- rowMedians(as.matrix(df[-(1:2)]))
unique(setDT(df)[order(-Medians)], by = "symbol")
# row.names symbol Sample1 Sample2 Sample3 Sample4 Medians
# 1: Probe5 Gene1 3.1 6.1 6.2 4.2 5.15
# 2: Probe2 Gene2 2.7 4.5 3.2 5.1 3.85
# 3: Probe3 Gene3 1.1 4.7 2.3 5.3 3.50
Some benchmarks
library(data.table)
library(matrixStats)
library(dplyr)
set.seed(123)
bigdf <- data.frame(A = paste0("Probe", 1:1e5),
symbol = paste0("Gene", sample(1e2, 1e5, replace = TRUE)),
matrix(sample(1e2, 1e6, replace = TRUE), ncol = 100))
bigdf2 <- copy(bigdf)
bigdf3 <- copy(bigdf2)
system.time({
bigdf$Medians <- rowMedians(as.matrix(bigdf[-(1:2)]))
unique(setDT(bigdf)[order(-Medians)], by = "symbol")
})
# user system elapsed
# 0.22 0.05 0.26
system.time(setDT(bigdf2)[,.SD[which.max(apply(.SD[,-(1:2), with = FALSE], 1, median)),], by = symbol])
# user system elapsed
# 5.17 0.01 5.33
system.time({
bigdf3$medianCol <- apply(bigdf3[-(1:2)],1,FUN = median)
grouped_df <- group_by(bigdf3,symbol)
filtered_df <- filter(grouped_df, medianCol == max(medianCol))
})
# user system elapsed
# 5.15 0.00 5.15
Upvotes: 3
Reputation: 1690
Or using dplyr:
library(dplyr)
df$medianCol <- apply(df[,2:5],1,FUN = median)
grouped_df <- group_by(df,symbol)
filtered_df <- filter(grouped_df, medianCol == max(medianCol))
filtered_df$medianCol <- NULL
Upvotes: 1