DN1
DN1

Reputation: 218

How to calculate average variation per groups in r?

I have a dataset of groups of genes with each gene having a different score. I am looking to calculate the average gene score and average variation/difference of scores between genes per group.

For example my data looks like:

Group   Gene      Score     direct_count   secondary_count 
    1   AQP11    0.5566507       4               5
    1   CLNS1A   0.2811747       0               2
    1   RSF1     0.5469924       3               6
    2   CFDP1    0.4186066       1               2
    2   CHST6    0.4295135       1               3
    3   ACE      0.634           1               1
    3   NOS2     0.6345          1               1

I am looking to add another column giving the average model score per group and a column for the average variation between scores per group.

So far for the average score per group, I am using

group_average_score <- aggregate( Score ~ Group, df, mean )

Although I am struggling to get this added as an additional column in the data.

Then for taking the average variation score per group I've been trying to go from a similar question (Calculate difference between values by group and matched for time) but I'm struggling to adjust this for my data. I've tried:

test <- df %>%
  group_by(Group) %>%
  mutate(Diff = c(NA, diff(Score)))

But I'm not sure this is calculating the average variation out of all gene's Score per group. The output using my real data gives a couple different variation average scores per group when there should be just one.

Expected output should look something like:

Group Gene      Score     direct_count   secondary_count    Average_Score    Average_Score_Difference
    1   AQP11    0.5566507       4               5             0.46160593          0.183650
    1   CLNS1A   0.2811747       0               2             0.46160593          0.183650
    1   RSF1     0.5469924       3               6             0.46160593          0.183650
    2   CFDP1    0.4186066       1               2                ...                 ...
    2   CHST6    0.4295135       1               3
    3   ACE      0.634           1               1
    3   NOS2     0.6345          1               1

I think the Average_Score_Difference is fine but just to note I've done it by hand for sake of example (differences each gene has with each other summed and divided by 3 for Group 1).

Input data:

structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), Gene = c("AQP11", 
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507, 
0.2811747, 0.5269924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L, 
0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L, 
3L, 1L, 1L)), row.names = c(NA, -7L), class = c("data.table", 
"data.frame"))

Upvotes: 0

Views: 413

Answers (2)

akrun
akrun

Reputation: 887213

Using data.table

library(data.table)
setDT(df)[, c('Avg', 'Diff') := .(mean(Score, na.rm = TRUE),
           c(0, abs(diff(Score)))), Group][, AvgPerc := mean(Diff), Group]

data

df <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), Gene = c("AQP11", 
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507, 
0.2811747, 0.5469924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L, 
0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L, 
3L, 1L, 1L)), class = "data.frame", row.names = c(NA, -7L))

Upvotes: 0

Duck
Duck

Reputation: 39605

Try this solution with dplyr but more infor about how to compute last column should be provided:

library(dplyr)
#Code
newdf <- df %>% group_by(Group) %>% mutate(Avg=mean(Score,na.rm = T),
                                  Diff=c(0,abs(diff(Score))),
                                  AvgPerc=mean(Diff,na.rm=T))

Output:

# A tibble: 7 x 8
# Groups:   Group [3]
  Group Gene   Score direct_count secondary_count   Avg     Diff  AvgPerc
  <int> <chr>  <dbl>        <int>           <int> <dbl>    <dbl>    <dbl>
1     1 AQP11  0.557            4               5 0.462 0        0.180   
2     1 CLNS1A 0.281            0               2 0.462 0.275    0.180   
3     1 RSF1   0.547            3               6 0.462 0.266    0.180   
4     2 CFDP1  0.419            1               2 0.424 0        0.00545 
5     2 CHST6  0.430            1               3 0.424 0.0109   0.00545 
6     3 ACE    0.634            1               1 0.634 0        0.000250
7     3 NOS2   0.634            1               1 0.634 0.000500 0.000250

Some data used:

#Data
df <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 3L, 3L), Gene = c("AQP11", 
"CLNS1A", "RSF1", "CFDP1", "CHST6", "ACE", "NOS2"), Score = c(0.5566507, 
0.2811747, 0.5469924, 0.4186066, 0.4295135, 0.634, 0.6345), direct_count = c(4L, 
0L, 3L, 1L, 1L, 1L, 1L), secondary_count = c(5L, 2L, 6L, 2L, 
3L, 1L, 1L)), class = "data.frame", row.names = c(NA, -7L))

Upvotes: 2

Related Questions