Group-wise statistic using dplyr

Question

I have a dataset which looks somewhat like this-

Col1 Col2 Col3 Col4 Col5
400  322  345  1    1
131  345  809  1    1 
565  676  311  2    1
121  645  777  2    1
322  534  263  3    1
545  222  111  3    1

I want to perform a group-wise calculation where for each unique value in Col5, I calculate a statistic for Col1:Col3 grouping by Col4-

(X(i,j)-X'(i,j))/S(i)

where X(i,j) represents the mean of the variable for group i,j (Col5,Col4) ,X' represents the mean of the other groups j for the same variable, and S is the standard deviation over the entire group i. For example, in the above case, the statistic for Col1 based on group 1 in Col4 will be-

(mean(400,131)-mean(565,121,322,545))/stddev(Col1)
(265.5-388.25)/193.85 = -0.633

I want to use the summarise function with ddply to calculate this for each of the variables and for each of the groups in Col4 and Col5.

PS- I hope I've been able to explain the problem clearly.

Thanks!

akrun · Accepted Answer

One option after grouping by 'Col4' and 'Col5' would be to get the sum of the whole column i.e. 'Col1' , subtract it from the sum of the elements of 'Col1' in each group, divide by the difference of the number of rows of the whole dataset with the group number of rows (n()) to get the mean of elements not in the group. Get the difference from the mean of the group and divide by the standard deviation of the entire column.

 library(dplyr)
 df1 %>% 
    group_by(Col4, Col5) %>% 
    summarise(ColN = (mean(Col1)-((sum(df1$Col1) - 
                   sum(Col1)))/(nrow(df1)-n()))/sd(df1$Col1) )

If we need for Col1:Col3, use summarise_each

 df1 %>% 
    group_by(Col4, Col5) %>%
    summarise_each(funs(((mean(.) - ((sum(df1$.)-sum(.))/(nrow(df1)- 
                      n())))/sd(df1$.))), Col1:Col3)
#  Col4  Col5       Col1       Col2       Col3
#                    
#1     1     1 -0.6332145 -0.9922312  0.7342422
#2     2     1 -0.0335307  1.6279003  0.5623983
#3     3     1  0.6667452 -0.6356690 -1.2966405

Group-wise statistic using dplyr

Answers (2)

Related Questions