Reputation: 313
I have a dataframe:
data <- data.frame(group = c('A', 'A', 'B', 'C', 'C'),
X1 = rnorm(5),
X2 = rnorm(5),
X3 = rnorm(5))
I would like to calculate the variance of each group under the column group. So variance of A would be variance of the first two rows and so forth. Ideally, I would like to compare the variance within a group and between groups; output would look like a correlation matrix. Desired Output:
Variance Table | A | B | C |
---|---|---|---|
A | var of A | var of A+B | var of A+C |
B | var of B+A | var of B | var of B+C |
C | var of C+A | var of C+B | var of C |
Where var of A+B means it's the variance of the first three rows.
Upvotes: 1
Views: 269
Reputation: 160447
Perhaps this?
out <- outer(
setNames(nm=unique(data$group)), setNames(nm=unique(data$group)),
Vectorize(function(a, b) var(unlist(subset(data, group %in% c(a, b), select = -group))))
)
out
# A B C
# A 1.1626004 1.3823834 0.9368846
# B 1.3823834 0.8256655 0.8212769
# C 0.9368846 0.8212769 0.7254770
Breakdown/verification: for "var A + B",
vecA <- unname(unlist(subset(data, group == "A", select = X1:X3)))
vecA
# [1] 1.3709584 -0.5646982 -0.1061245 1.5115220 1.3048697 2.2866454
vecB <- unname(unlist(subset(data, group == "B", select = X1:X3)))
vecB
# [1] 0.36312841 -0.09465904 -1.38886070
var(c(vecA, vecB))
# [1] 1.382383
Note: the use of setNames(nm=.)
is a trick that assigns names to themselves: see that setNames(nm=c("A","B"))
is the same as setNames(c("A","B"),nm=c("A","B"))
, so when object=
(first arg) is not provided, it uses nm=
for both the objects and their names. Self-naming.
One reason this is useful is that outer
is going to preserve the names of its arguments in the column and row names. One may want (and expect) them to be in alphabetic order, but unique(.)
preserves the natural sort,
unique(c("A","C","B","C"))
# [1] "A" "C" "B"
unique(c("A","B","C","C"))
# [1] "A" "B" "C"
So if data$group
did not already have the first occurrence of each in the expected order, then the column names (and therefore order of the values) would change:
set.seed(2022)
data <- data[sample(5),]
data
# group X1 X2 X3
# 4 C 0.6328626 2.01842371 -0.2787888
# 3 B 0.3631284 -0.09465904 -1.3888607
# 2 A -0.5646982 1.51152200 2.2866454
# 1 A 1.3709584 -0.10612452 1.3048697
# 5 C 0.4042683 -0.06271410 -0.1333213
outer(
setNames(nm=unique(data$group)), setNames(nm=unique(data$group)),
Vectorize(function(a, b) var(unlist(subset(data, group %in% c(a, b), select = -group))))
)
# C B A
# C 0.7254770 0.8212769 0.9368846
# B 0.8212769 0.8256655 1.3823834
# A 0.9368846 1.3823834 1.1626004
In this case, the use of setNames(nm=.)
is ensuring that regardless of the order of unique(data$group)
, there is no doubt which column/row pairings we're looking at.
Starting data:
set.seed(42)
data <- data.frame(group = c('A', 'A', 'B', 'C', 'C'),
X1 = rnorm(5),
X2 = rnorm(5),
X3 = rnorm(5))
Upvotes: 2