Mridul Garg
Mridul Garg

Reputation: 497

Group-wise statistic using dplyr

I have a dataset which looks somewhat like this-

Col1 Col2 Col3 Col4 Col5
400  322  345  1    1
131  345  809  1    1 
565  676  311  2    1
121  645  777  2    1
322  534  263  3    1
545  222  111  3    1

I want to perform a group-wise calculation where for each unique value in Col5, I calculate a statistic for Col1:Col3 grouping by Col4-

(X(i,j)-X'(i,j))/S(i)

where X(i,j) represents the mean of the variable for group i,j (Col5,Col4) ,X' represents the mean of the other groups j for the same variable, and S is the standard deviation over the entire group i. For example, in the above case, the statistic for Col1 based on group 1 in Col4 will be-

(mean(400,131)-mean(565,121,322,545))/stddev(Col1)
(265.5-388.25)/193.85 = -0.633

I want to use the summarise function with ddply to calculate this for each of the variables and for each of the groups in Col4 and Col5.

PS- I hope I've been able to explain the problem clearly.

Thanks!

Upvotes: 1

Views: 365

Answers (2)

Hack-R
Hack-R

Reputation: 23231

Just for fun, here's an answer equivalent to @akrun's but without using dplyr:

a <- matrix(ncol=3, nrow=3)   
n <- 1
for(i in unique(df$Col5)){
 for(ii in unique(df$Col4[df$Col5 == i])){
  a[n,1] <- i
  a[n,2] <- ii
  a[n,3] <- (mean(df$Col1[df$Col4 == ii]) - mean(df$Col1[!df$Col4 == ii])) / sd(df$Col1)
  n      <- n + 1
 }
}

> a
     [,1] [,2]       [,3]
[1,]    1    1 -0.6332145
[2,]    1    2 -0.0335307
[3,]    1    3  0.6667452

This is for Col1. You can add the same thing for Col2 and Col3 with copy and paste.

Upvotes: 1

akrun
akrun

Reputation: 887911

One option after grouping by 'Col4' and 'Col5' would be to get the sum of the whole column i.e. 'Col1' , subtract it from the sum of the elements of 'Col1' in each group, divide by the difference of the number of rows of the whole dataset with the group number of rows (n()) to get the mean of elements not in the group. Get the difference from the mean of the group and divide by the standard deviation of the entire column.

 library(dplyr)
 df1 %>% 
    group_by(Col4, Col5) %>% 
    summarise(ColN = (mean(Col1)-((sum(df1$Col1) - 
                   sum(Col1)))/(nrow(df1)-n()))/sd(df1$Col1) )

If we need for Col1:Col3, use summarise_each

 df1 %>% 
    group_by(Col4, Col5) %>%
    summarise_each(funs(((mean(.) - ((sum(df1$.)-sum(.))/(nrow(df1)- 
                      n())))/sd(df1$.))), Col1:Col3)
#  Col4  Col5       Col1       Col2       Col3
# <int> <int>      <dbl>      <dbl>      <dbl>
#1     1     1 -0.6332145 -0.9922312  0.7342422
#2     2     1 -0.0335307  1.6279003  0.5623983
#3     3     1  0.6667452 -0.6356690 -1.2966405

Upvotes: 1

Related Questions