How to calculate the row-variance of each group?

Question

I have a dataframe:

data <- data.frame(group = c('A', 'A', 'B', 'C', 'C'),
                   X1 = rnorm(5),
                   X2 = rnorm(5),
                   X3 = rnorm(5))

I would like to calculate the variance of each group under the column group. So variance of A would be variance of the first two rows and so forth. Ideally, I would like to compare the variance within a group and between groups; output would look like a correlation matrix. Desired Output:

Variance Table	A	B	C
A	var of A	var of A+B	var of A+C
B	var of B+A	var of B	var of B+C
C	var of C+A	var of C+B	var of C

Where var of A+B means it's the variance of the first three rows.

r2evans · Accepted Answer

Perhaps this?

out <- outer(
  setNames(nm=unique(data$group)), setNames(nm=unique(data$group)),
  Vectorize(function(a, b) var(unlist(subset(data, group %in% c(a, b), select = -group))))
)
out
#           A         B         C
# A 1.1626004 1.3823834 0.9368846
# B 1.3823834 0.8256655 0.8212769
# C 0.9368846 0.8212769 0.7254770

Breakdown/verification: for "var A + B",

vecA <- unname(unlist(subset(data, group == "A", select = X1:X3)))
vecA
# [1]  1.3709584 -0.5646982 -0.1061245  1.5115220  1.3048697  2.2866454
vecB <- unname(unlist(subset(data, group == "B", select = X1:X3)))
vecB
# [1]  0.36312841 -0.09465904 -1.38886070
var(c(vecA, vecB))
# [1] 1.382383

Note: the use of setNames(nm=.) is a trick that assigns names to themselves: see that setNames(nm=c("A","B")) is the same as setNames(c("A","B"),nm=c("A","B")), so when object= (first arg) is not provided, it uses nm= for both the objects and their names. Self-naming.

One reason this is useful is that outer is going to preserve the names of its arguments in the column and row names. One may want (and expect) them to be in alphabetic order, but unique(.) preserves the natural sort,

unique(c("A","C","B","C"))
# [1] "A" "C" "B"
unique(c("A","B","C","C"))
# [1] "A" "B" "C"

So if data$group did not already have the first occurrence of each in the expected order, then the column names (and therefore order of the values) would change:

set.seed(2022)
data <- data[sample(5),]
data
#   group         X1          X2         X3
# 4     C  0.6328626  2.01842371 -0.2787888
# 3     B  0.3631284 -0.09465904 -1.3888607
# 2     A -0.5646982  1.51152200  2.2866454
# 1     A  1.3709584 -0.10612452  1.3048697
# 5     C  0.4042683 -0.06271410 -0.1333213
outer(
  setNames(nm=unique(data$group)), setNames(nm=unique(data$group)),
  Vectorize(function(a, b) var(unlist(subset(data, group %in% c(a, b), select = -group))))
)
#           C         B         A
# C 0.7254770 0.8212769 0.9368846
# B 0.8212769 0.8256655 1.3823834
# A 0.9368846 1.3823834 1.1626004

In this case, the use of setNames(nm=.) is ensuring that regardless of the order of unique(data$group), there is no doubt which column/row pairings we're looking at.

Starting data:

set.seed(42)
data <- data.frame(group = c('A', 'A', 'B', 'C', 'C'),
                   X1 = rnorm(5),
                   X2 = rnorm(5),
                   X3 = rnorm(5))

How to calculate the row-variance of each group?

Answers (1)

Related Questions