Reputation: 497
I have a dataset which looks somewhat like this-
Col1 Col2 Col3 Col4 Col5
400 322 345 1 1
131 345 809 1 1
565 676 311 2 1
121 645 777 2 1
322 534 263 3 1
545 222 111 3 1
I want to perform a group-wise calculation where for each unique value in Col5, I calculate a statistic for Col1:Col3 grouping by Col4-
(X(i,j)-X'(i,j))/S(i)
where X(i,j) represents the mean of the variable for group i,j (Col5,Col4) ,X' represents the mean of the other groups j for the same variable, and S is the standard deviation over the entire group i. For example, in the above case, the statistic for Col1 based on group 1 in Col4 will be-
(mean(400,131)-mean(565,121,322,545))/stddev(Col1)
(265.5-388.25)/193.85 = -0.633
I want to use the summarise function with ddply to calculate this for each of the variables and for each of the groups in Col4 and Col5.
PS- I hope I've been able to explain the problem clearly.
Thanks!
Upvotes: 1
Views: 365
Reputation: 23231
Just for fun, here's an answer equivalent to @akrun's but without using dplyr
:
a <- matrix(ncol=3, nrow=3)
n <- 1
for(i in unique(df$Col5)){
for(ii in unique(df$Col4[df$Col5 == i])){
a[n,1] <- i
a[n,2] <- ii
a[n,3] <- (mean(df$Col1[df$Col4 == ii]) - mean(df$Col1[!df$Col4 == ii])) / sd(df$Col1)
n <- n + 1
}
}
> a
[,1] [,2] [,3]
[1,] 1 1 -0.6332145
[2,] 1 2 -0.0335307
[3,] 1 3 0.6667452
This is for Col1. You can add the same thing for Col2 and Col3 with copy and paste.
Upvotes: 1
Reputation: 887911
One option after grouping by 'Col4' and 'Col5' would be to get the sum
of the whole column i.e. 'Col1' , subtract it from the sum
of the elements of 'Col1' in each group, divide by the difference of the number of rows of the whole dataset with the group number of rows (n()
) to get the mean
of elements not in the group. Get the difference from the mean
of the group and divide by the standard deviation of the entire column.
library(dplyr)
df1 %>%
group_by(Col4, Col5) %>%
summarise(ColN = (mean(Col1)-((sum(df1$Col1) -
sum(Col1)))/(nrow(df1)-n()))/sd(df1$Col1) )
If we need for Col1:Col3
, use summarise_each
df1 %>%
group_by(Col4, Col5) %>%
summarise_each(funs(((mean(.) - ((sum(df1$.)-sum(.))/(nrow(df1)-
n())))/sd(df1$.))), Col1:Col3)
# Col4 Col5 Col1 Col2 Col3
# <int> <int> <dbl> <dbl> <dbl>
#1 1 1 -0.6332145 -0.9922312 0.7342422
#2 2 1 -0.0335307 1.6279003 0.5623983
#3 3 1 0.6667452 -0.6356690 -1.2966405
Upvotes: 1