Reputation: 218
I have a dataset where samples/genes are grouped by matching IDs. I am trying to compare the groups of matched IDs.
For example I have:
ID Gene Score
1:10 Gene1 0.8
1:10 Gene1 0.78
1:10 Gene4 0.6
2:20 Gene5 0.1
2:20 Gene6 0.7
3:30 Gene7 0.4
3:30 Gene8 0.6
3:30 Gene8 0.5
I am trying to find various stats like the percentage of matching ID groups which have more than 1 gene with a >0.7 score in their group (from my example data this would be 33.3% of matching ID groups have that) or how many groups have a 0.7 scored gene and a 0.1 scored gene present under the same matched ID (also 33.3% of groups in the example).
I have been trying to use duplicated()
and filter()
to start this but beyond these I'm not sure what functions to try, any advice on functions to try would be appreciated.
Input data:
structure(list(ID = c("1:10", "1:10", "1:10", "2:20", "2:20",
"3:30", "3:30", "3:30"), Gene = c("Gene1", "Gene1", "Gene4",
"Gene5", "Gene6", "Gene7", "Gene8", "Gene8"), Score = c(0.8,
0.78, 0.6, 0.1, 0.7, 0.4, 0.6, 0.5)), row.names = c(NA, -8L), class = c("data.table",
"data.frame"))
Upvotes: 1
Views: 71
Reputation: 42544
For the sake of completeness and because the question has a data.table
tag:
ds[, any(Score > 0.7), by = .(ID)][, sum(V1)/length(V1)]
[1] 0.3333333
ds[, any(Score == 0.7 | Score == 0.1), by = .(ID)][, sum(V1)/length(V1)]
or
ds[, any(Score %in% c(0.1, 0.7)), by = .(ID)][, sum(V1)/length(V1)]
[1] 0.3333333
In order to verify the results are correct we can integrate a print()
statement in the chained data.table
expressions:
ds[, any(Score > 0.7), by = .(ID)][, print(.SD)][, sum(V1)/length(V1)]
ID V1 1: 1:10 TRUE 2: 2:20 FALSE 3: 3:30 FALSE [1] 0.3333333
library(data.table)
ds <- structure(list(ID = c("1:10", "1:10", "1:10", "2:20", "2:20",
"3:30", "3:30", "3:30"), Gene = c("Gene1", "Gene1", "Gene4",
"Gene5", "Gene6", "Gene7", "Gene8", "Gene8"), Score = c(0.8,
0.78, 0.6, 0.1, 0.7, 0.4, 0.6, 0.5)), row.names = c(NA, -8L), class = c("data.table",
"data.frame"))
Upvotes: 4
Reputation: 1305
library("dplyr")
df <- structure(list(ID = c("1:10", "1:10", "1:10", "2:20", "2:20",
"3:30", "3:30", "3:30"), Gene = c("Gene1", "Gene1", "Gene4",
"Gene5", "Gene6", "Gene7", "Gene8", "Gene8"), Score = c(0.8,
0.78, 0.6, 0.1, 0.7, 0.4, 0.6, 0.5)), row.names = c(NA, -8L), class = c("data.table",
"data.frame"))
ID group has more than one gene with a score > 0.7
df %>%
group_by(ID) %>%
summarize(cond = sum(Score > 0.7) > 1) %>%
replace_na(list(cond = FALSE)) %>%
summarize(frac = sum(cond) / n())
# A tibble: 1 x 1
frac
<dbl>
1 0.333
ID group has at least one gene with a score of 0.7, and at least one with 0.1
df %>%
group_by(ID, Gene) %>%
summarize(cond = any(Score %in% c(0.1, 0.7))) %>%
replace_na(list(cond = FALSE)) %>%
group_by(ID) %>%
summarize(cond = any(cond)) %>%
summarize(frac = sum(cond) / n())
# A tibble: 1 x 1
frac
<dbl>
1 0.333
Upvotes: 2