Reputation: 83
I have a dataframe of expression data where gene are rows and columns are samples. I also have a dataframe containing metadata for each sample in the expression dataframe. In reality my expr dataframe has 30,000+ rows and 100+ columns. However, below is an example with smaller data.
expr <- data.frame(sample1 = c(1,2,2,0,0),
sample2 = c(5,2,4,4,0),
sample3 = c(1,2,1,0,1),
sample4 = c(6,5,6,6,7),
sample5 = c(0,0,0,1,1))
rownames(expr) <- paste0("gene",1:5)
meta <- data.frame(sample = paste0("sample",1:5),
treatment = c("control","control",
"treatment1",
"treatment2", "treatment2"))
I want to find the mean for each gene per treatment. From the examples I've seen with split() or group_by() people group based on a column already present in the data.frame. However, I have a separate dataframe (meta) that classifies the grouping for the columns in another dataframe (expr).
I would like my output to be a dataframe with genes as rows, treatment as columns, and values as the mean.
# control treatment1 treatment2
# gene1 mean mean mean
# gene2 mean mean mean
Upvotes: 7
Views: 211
Reputation: 28825
Here's another approach in tidyverse
, replacing sample numbers with treatments from meta
by creating a named vector instead of left_join()
, and also using values_fn
within pivot_wider()
instead of summarise()
:
library(dplyr)
library(tidyr)
expr %>%
tibble::rownames_to_column("gene") %>%
pivot_longer(-gene) %>%
mutate(name = split(meta$treatment, meta$sample)[name]) %>%
pivot_wider(values_fn = mean)
#> # A tibble: 5 × 4
#> gene control treatment1 treatment2
#> <chr> <dbl> <dbl> <dbl>
#> 1 gene1 3 1 3
#> 2 gene2 2 2 2.5
#> 3 gene3 3 1 3
#> 4 gene4 2 0 3.5
#> 5 gene5 0 1 4
Created on 2023-12-21 with reprex v2.0.2
Upvotes: 2
Reputation: 79228
A base R approach:
expr|>
split.default(with(meta, treatment[match(names(expr), sample)]))|>
lapply(rowMeans)|>
structure(dim=3)|>
array2DF()
Var1 gene1 gene2 gene3 gene4 gene5
1 control 3 2.0 3 2.0 0
2 treatment1 1 2.0 1 0.0 1
3 treatment2 3 2.5 3 3.5 4
Upvotes: 4
Reputation: 78927
Here is a data.table
approach with the same logic provided by @Gregor Thomas:
library(data.table)
expr_dt <- setDT(expr)
expr_dt[, gene := rownames(expr)]
meta_dt <- setDT(meta)
melt(expr_dt, id.vars = "gene", variable.name = "sample", value.name = "expression")[
meta_dt, on = .(sample)][
, .(mean = mean(expression)), by = .(gene, treatment)][
, dcast(.SD, gene ~ treatment, value.var = "mean")]
gene control treatment1 treatment2
1: 1 3 1 3.0
2: 2 2 2 2.5
3: 3 3 1 3.0
4: 4 2 0 3.5
5: 5 0 1 4.0
Upvotes: 2
Reputation: 7410
An approach in base R which works for the particular toy data example given:
colnames(expr) = paste0(colnames(expr), "_",
meta$treatment[match(colnames(expr), meta$sample)])
vapply(unique(meta$treatment),
\(i) rowMeans(expr[grepl(i, colnames(expr))]), numeric(nrow(expr)))
#> control treatment1 treatment2
#> gene1 3 1 3.0
#> gene2 2 2 2.5
#> gene3 3 1 3.0
#> gene4 2 0 3.5
#> gene5 0 1 4.0
Data
expr <- data.frame(sample1 = c(1,2,2,0,0),
sample2 = c(5,2,4,4,0),
sample3 = c(1,2,1,0,1),
sample4 = c(6,5,6,6,7),
sample5 = c(0,0,0,1,1))
rownames(expr) <- paste0("gene",1:5)
meta <- data.frame(sample = paste0("sample",1:5),
treatment = c("control","control",
"treatment1",
"treatment2", "treatment2"))
Upvotes: 6
Reputation: 145775
Something like this. It's not entirely clear what you want to group by in the last step, but you can adjust that easily.
library(dplyr)
library(tidyr)
expr |>
mutate(gene = row.names(expr)) |>
pivot_longer(-gene, names_to = "sample") |>
left_join(meta, by = "sample") |>
summarize(mean = mean(value), .by = c(gene, treatment)) |>
pivot_wider(names_from = treatment, values_from = mean)
# # A tibble: 5 × 4
# gene control treatment1 treatment2
# <chr> <dbl> <dbl> <dbl>
# 1 gene1 3 1 3
# 2 gene2 2 2 2.5
# 3 gene3 3 1 3
# 4 gene4 2 0 3.5
# 5 gene5 0 1 4
Upvotes: 2