jimuq233
jimuq233

Reputation: 53

Taking means of duplicated IDs

I have a dataset "rna", which is thousands of rows that I have simplified below. There are rows that contain both the same "Sample" and "Ensembl", but have different expression values. I need a way to take the mean of the expression values for rows that have both the same sample ID and ensembl. Rows that contain unique combinations of "Sample" and "Ensembl" should not have expression values changed. I've tried to write a for loop with selecting for duplicated rows, but I can't figure out how to pair them in order to take the Expression means. "org-rna" matrix shows what I want the result to be. TYIA

rna = data.frame("Sample" = c(1, 1, 1, 2, 2, 3), "Ensembl" =  c("ENSG00000174010", "ENSG00000174010", "ENSG00000192738", "ENSG00000147183", "ENSG00000147183",
    "ENSG00000231267"), "Expression" = c(1, 2, 3.5, 1, 6, 8))


org_rna = data.frame("Sample" = c(1, 1, 2,3), "Ensembl" = c("ENSG00000174010", "ENSG00000192738", "ENSG00000147183", 
           "ENSG00000231267"), "Expression" =c(1.5, 3.5, 3.5, 8)) 

Upvotes: 1

Views: 58

Answers (2)

akrun
akrun

Reputation: 887241

Or could use base R

aggregate(Expression ~ ., type.convert(rna, as.is = TRUE), mean)

Upvotes: 2

Martin Gal
Martin Gal

Reputation: 16988

If you are open to a tidyverse solution you could use

library(dplyr)

rna %>% 
  as_tibble() %>% 
  group_by(Sample, Ensembl) %>% 
  summarise(Expression = mean(as.numeric(Expression)), .groups = "drop")

which returns

# A tibble: 4 x 3
  Sample Ensembl         Expression
  <chr>  <chr>                <dbl>
1 1      ENSG00000174010        1.5
2 1      ENSG00000192738        3.5
3 2      ENSG00000147183        3.5
4 3      ENSG00000231267        8  

Note: I converted your matrix into a tibble/data.frame which is easier to work on.

You need to group by Sample and Ensemble and calculate the mean.

Upvotes: 3

Related Questions