DN1
DN1

Reputation: 218

How to compare grouped duplicate rows in R

I have a dataset where samples/genes are grouped by matching IDs. I am trying to compare the groups of matched IDs.

For example I have:

ID      Gene    Score
1:10    Gene1    0.8
1:10    Gene1    0.78
1:10    Gene4    0.6
2:20    Gene5    0.1
2:20    Gene6    0.7
3:30    Gene7    0.4
3:30    Gene8    0.6  
3:30    Gene8    0.5

I am trying to find various stats like the percentage of matching ID groups which have more than 1 gene with a >0.7 score in their group (from my example data this would be 33.3% of matching ID groups have that) or how many groups have a 0.7 scored gene and a 0.1 scored gene present under the same matched ID (also 33.3% of groups in the example).

I have been trying to use duplicated() and filter() to start this but beyond these I'm not sure what functions to try, any advice on functions to try would be appreciated.

Input data:

structure(list(ID = c("1:10", "1:10", "1:10", "2:20", "2:20", 
"3:30", "3:30", "3:30"), Gene = c("Gene1", "Gene1", "Gene4", 
"Gene5", "Gene6", "Gene7", "Gene8", "Gene8"), Score = c(0.8, 
0.78, 0.6, 0.1, 0.7, 0.4, 0.6, 0.5)), row.names = c(NA, -8L), class = c("data.table", 
"data.frame"))

Upvotes: 1

Views: 71

Answers (2)

Uwe
Uwe

Reputation: 42544

For the sake of completeness and because the question has a data.table tag:

ds[, any(Score > 0.7), by = .(ID)][, sum(V1)/length(V1)]
[1] 0.3333333
ds[, any(Score == 0.7 | Score == 0.1), by = .(ID)][, sum(V1)/length(V1)]

or

ds[, any(Score %in% c(0.1, 0.7)), by = .(ID)][, sum(V1)/length(V1)]
[1] 0.3333333

In order to verify the results are correct we can integrate a print() statement in the chained data.table expressions:

ds[, any(Score > 0.7), by = .(ID)][, print(.SD)][, sum(V1)/length(V1)]
     ID    V1
1: 1:10  TRUE
2: 2:20 FALSE
3: 3:30 FALSE
[1] 0.3333333

Data

library(data.table)
ds <- structure(list(ID = c("1:10", "1:10", "1:10", "2:20", "2:20", 
"3:30", "3:30", "3:30"), Gene = c("Gene1", "Gene1", "Gene4", 
"Gene5", "Gene6", "Gene7", "Gene8", "Gene8"), Score = c(0.8, 
0.78, 0.6, 0.1, 0.7, 0.4, 0.6, 0.5)), row.names = c(NA, -8L), class = c("data.table", 
"data.frame"))

Upvotes: 4

rpolicastro
rpolicastro

Reputation: 1305

library("dplyr")

df <- structure(list(ID = c("1:10", "1:10", "1:10", "2:20", "2:20", 
"3:30", "3:30", "3:30"), Gene = c("Gene1", "Gene1", "Gene4", 
"Gene5", "Gene6", "Gene7", "Gene8", "Gene8"), Score = c(0.8, 
0.78, 0.6, 0.1, 0.7, 0.4, 0.6, 0.5)), row.names = c(NA, -8L), class = c("data.table", 
"data.frame"))

ID group has more than one gene with a score > 0.7

df %>%
  group_by(ID) %>%
  summarize(cond = sum(Score > 0.7) > 1) %>%
  replace_na(list(cond = FALSE)) %>%
  summarize(frac = sum(cond) / n())

# A tibble: 1 x 1
   frac
  <dbl>
1 0.333

ID group has at least one gene with a score of 0.7, and at least one with 0.1

df %>%
  group_by(ID, Gene) %>%
  summarize(cond = any(Score %in% c(0.1, 0.7))) %>%
  replace_na(list(cond = FALSE)) %>%
  group_by(ID) %>%
  summarize(cond = any(cond)) %>%
  summarize(frac = sum(cond) / n())

# A tibble: 1 x 1
   frac
  <dbl>
1 0.333

Upvotes: 2

Related Questions