Reputation: 218
I have a dataset of groups of genes with each gene having a different score. I am looking to calculate the average gene score and average variation/difference of scores between genes per group.
For example my data looks like:
Group Gene Score direct_count secondary_count
1 AQP11 0.5566507 4 5
1 CLNS1A 0.2811747 0 2
1 RSF1 0.5469924 3 6
2 CFDP1 0.4186066 1 2
2 CHST6 0.4295135 1 3
3 ACE 0.634 1 1
3 NOS2 0.6345 1 1
I am looking to add another column giving the average model score per group and a column for the average variation between scores per group.
So far for the average score per group, I am using
group_average_score <- aggregate( Score ~ Group, df, mean )
Although I am struggling to get this added as an additional column in the data.
Then for taking the average variation score per group I've been trying to go from a similar question (Calculate difference between values by group and matched for time) but I'm struggling to adjust this for my data. I've tried:
test <- df %>%
group_by(Group) %>%
mutate(Diff = c(NA, diff(Score)))
But I'm not sure this is calculating the average variation out of all gene's Score
per group. The output using my real data gives a couple different variation average scores per group when there should be just one.
Expected output should look something like:
Group Gene Score direct_count secondary_count Average_Score Average_Score_Difference
1 AQP11 0.5566507 4 5 0.46160593 0.183650
1 CLNS1A 0.2811747 0 2 0.46160593 0.183650
1 RSF1 0.5469924 3 6 0.46160593 0.183650
2 CFDP1 0.4186066 1 2 ... ...
2 CHST6 0.4295135 1 3
3 ACE 0.634 1 1
3 NOS2 0.6345 1 1
I think the Average_Score_Difference
is fine but just to note I've done it by hand for sake of example (differences each gene has with each other summed and divided by 3 for Group 1).
Input data:
structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), Gene = c("AQP11",
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507,
0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L,
0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L,
3L, 1L, 1L)), row.names = c(NA, -7L), class = c("data.table",
"data.frame"))
Upvotes: 0
Views: 413
Reputation: 887213
Using data.table
library(data.table)
setDT(df)[, c('Avg', 'Diff') := .(mean(Score, na.rm = TRUE),
c(0, abs(diff(Score)))), Group][, AvgPerc := mean(Diff), Group]
df <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), Gene = c("AQP11",
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507,
0.2811747, 0.5469924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L,
0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L,
3L, 1L, 1L)), class = "data.frame", row.names = c(NA, -7L))
Upvotes: 0
Reputation: 39605
Try this solution with dplyr
but more infor about how to compute last column should be provided:
library(dplyr)
#Code
newdf <- df %>% group_by(Group) %>% mutate(Avg=mean(Score,na.rm = T),
Diff=c(0,abs(diff(Score))),
AvgPerc=mean(Diff,na.rm=T))
Output:
# A tibble: 7 x 8
# Groups: Group [3]
Group Gene Score direct_count secondary_count Avg Diff AvgPerc
<int> <chr> <dbl> <int> <int> <dbl> <dbl> <dbl>
1 1 AQP11 0.557 4 5 0.462 0 0.180
2 1 CLNS1A 0.281 0 2 0.462 0.275 0.180
3 1 RSF1 0.547 3 6 0.462 0.266 0.180
4 2 CFDP1 0.419 1 2 0.424 0 0.00545
5 2 CHST6 0.430 1 3 0.424 0.0109 0.00545
6 3 ACE 0.634 1 1 0.634 0 0.000250
7 3 NOS2 0.634 1 1 0.634 0.000500 0.000250
Some data used:
#Data
df <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), Gene = c("AQP11",
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507,
0.2811747, 0.5469924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L,
0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L,
3L, 1L, 1L)), class = "data.frame", row.names = c(NA, -7L))
Upvotes: 2