Variance of a complete group of a dataframe in R

Question

Let's say I have a dataframe with 10+1 columns and 10 rows, and every value has the same units except for one column (the "grouping" column A). I'm trying to accomplish the following: given a grouping of the data frames based on the last column, how do I compute the standard deviation of the whole block as a single, monolithic variable.

Let's say I do the grouping (in reality it's a cut in intervals):

df %>% group_by(A)

From what I have gathered trhoughout this site, you can use aggregate or other dplyr methods to calculate variance per column, i.e.: this (SO won't let me embed if I have <10 rep). In that picture we can see the grouping as colors, but by using aggregate I would get 1 standard deviation per specified column (I know you can use cbind to get more than 1 variable, for example aggregate(cbind(V1,V2)~A, df, sd)) and per group (and similar methods using dplyr and %>%, with summarise(..., FUN=sd) appended at the end).

However what I want is this: just like in Matlab when you do

group1 = df(row_group,:) % row_group would be df(:,end)==1 in this case
stdev(group1(:)) % operator (:) is key here
% iterate for every group

I have my reasons for wanting it that specific way, and of course the real dataframe is bigger than this mock example.

Minimum working example:

df <- data.frame(cbind(matrix(rnorm(100),10,10),c(1,2,1,1,2,2,3,3,3,1)))
colnames(df) <- c(paste0("V",seq(1,10)),"A")

df %>% group_by(A) %>% summarise_at(vars(V1), funs(sd(.))) # no good
aggregate(V1~A, data=df, sd) # no good
aggregate(cbind(V1,V2,V3,V4,V5,V6,V7,V8,V9,V10)~A, data=df, sd) # nope
df %>% group_by(A) %>% summarise_at(vars(V1,V2,V3,V4,V5,V6,V7,V8,V9,V10), funs(sd(.))) # same as above...

Result should be 3 doubles, each with the sd of the group (which should be close to 1 if enough columns are added).

Rui Barradas · Accepted Answer

If you want a base R solution, try the following.

sp <- split(df[-1], cut(df$A, breaks=c(2.1)))
lapply(sp, function(x) var(unlist(x)))
#$`(0.998,2]`
#[1] 0.848707
#
#$`(2,3]`
#[1] 1.80633

I have coded it in two lines to make it clearer but you can avoid the creation of sp and write the one-liner

lapply(split(df[-1], cut(df$A, breaks=c(2.1))), function(x) var(unlist(x)))

Or, for a result in another form,

sapply(sp, function(x) var(unlist(x)))
#(0.998,2]     (2,3] 
# 0.848707  1.806330

DATA

set.seed(6322)    # make the results reproducible
df <- data.frame(cbind(matrix(rnorm(100),10,10),c(1,2,1,1,2,2,3,3,3,1)))
colnames(df) <- c(paste0("V",seq(1,10)),"A")

Variance of a complete group of a dataframe in R

Answers (1)

Related Questions