rnorouzian
rnorouzian

Reputation: 7517

Obtaining basic statistics on multiple variables and multiple groups

I want to calculate a 2 basic statistics for my data below on the 2 variables y1 and y2.

First, for each group, I want to separately obtain variance*n_of_group-1 (e.g., for group==1 the answer will be 6 on y1 and 2 on y2).

Second, for each group, I want to separately obtain covariance*n_of_group-1 (e.g., for group==1 the answer will be 0).

I have tried something, but I wonder how to apply the *n_of_group-1 part to my R code below?

ps. n_of_group simply is the count() or n() of each group. My desired output is shown below.

z <- "group    y1    y2
1 1         2     3
2 1         3     4
3 1         5     4
4 1         2     5
5 2         4     8
6 2         5     6
7 2         6     7
8 3         7     6
9 3         8     7
10 3        10     8
11 3         9     5
12 3         7     6"

dat <- read.table(text = z, header = T)

dat %>%
  group_by(group) %>%
  summarise(var1 = var(y1), var2 = var(y2)) # how to apply the `*n_of_group-1` to var1 & var2

dat %>%
  group_by(group) %>%
  summarise(co = cov(y1,y2)) # how to apply the `*n_of_group-1` to co, what if `co` was more than 1 number

Desired output (if we put the results above for each group in a 2x2 matrix):

group1 = matrix(c(6,0,0,2),2)   # The two repetitive element in the middle (0,0) are 
                                # the second statistic, the other elements are the 
                                # first statistics
group2 = matrix(c(2,-1,-1,2),2)
group3 = matrix(c(6.8,2.6,2.6,5.2),2)

Upvotes: 1

Views: 82

Answers (2)

akrun
akrun

Reputation: 886938

We can also use across

library(dplyr)
dat %>% 
    group_by(group) %>%
    summarise(co = cov(y1, y2) * (n() - 1), 
       across(c(y1, y2), ~ var(.) * (n() - 1), 
             .names = 'var_{.col}'), .groups = 'drop')

-output

# A tibble: 3 x 4
#  group    co var_y1 var_y2
#  <int> <dbl>  <dbl>  <dbl>
#1     1   0      6      2  
#2     2  -1      2      2  
#3     3   2.6    6.8    5.2

In addition, it may be better to create the n first

library(tibble)
dat %>% 
   add_count(group) %>%
   group_by(group) %>%
   summarise(co = cov(y1, y2) * (first(n) - 1), 
   across(c(y1, y2), ~ var(.) * (first(n)- 1), 
             .names = 'var_{.col}'), .groups = 'drop')

Upvotes: 1

Jon Spring
Jon Spring

Reputation: 66415

Is this what you want?

dat %>%
  group_by(group) %>%
  summarise(var1 = var(y1) * (n()-1), 
            var2 = var(y2) * (n()-1),
            co   = cov(y1, y2) * (n()-1))


# A tibble: 3 x 4
  group  var1  var2    co
* <int> <dbl> <dbl> <dbl>
1     1   6     2     0  
2     2   2     2    -1  
3     3   6.8   5.2   2.6

To output into separate matrices for each group:

dat %>%
  group_by(group) %>%
  summarise(var1 = var(y1) * (n()-1), 
            var2 = var(y2) * (n()-1),
            co   = cov(y1, y2) * (n()-1),
            co2  = co) %>%
  select(group, var1, co, co2, var2) -> a

split(a, a$group) -> a
lapply(a, function(x) { x["group"] <- NULL; x }) -> a
lapply(a, function(x) { matrix(x, nrow = 2, ncol = 2)})

$`1`
     [,1] [,2]
[1,] 6    0   
[2,] 0    2   

$`2`
     [,1] [,2]
[1,] 2    -1  
[2,] -1   2   

$`3`
     [,1] [,2]
[1,] 6.8  2.6 
[2,] 2.6  5.2 

Upvotes: 1

Related Questions