Reputation: 15708
I have a data frame made by row binding many data frames, each identified with a unique key. I wish to calculate the correlation coefficients for columns in each subset (using the unique key) of the big data frame. For example, using the mtcars data I might want to calculate the correlation between columns hp
and wt
for each unique value in column cyl
. I could do it in a loop
data("mtcars")
for(i in c(4,6,8)){
temp = subset(mtcars,mtcars$cyl==i)
cor(temp$hp,temp$wt)
}
I think aggregate would be better, but this code doesn't work:
data("mtcars")
aggregate(mtcars,by=mycars$cyl,cor)
Upvotes: 7
Views: 8907
Reputation: 4474
You could use
data("mtcars")
library(plyr)
ddply(mtcars, "cyl", function(x) cor(x$hp, x$wt))
This splits the data in mtcars
by cyl
, applies for each subset x
the function cor(x$hp, x$wt)
and then aggregates the results for each of the subsets in a data.frame.
I can highly recommend the plyr
package. It's one of the packages I use most in R.
Edit: As per request, here a dplyr
version. I have to say that I am not a big dplyr
user, but the code should be ok.
library(dplyr)
mtcars %>% group_by(cyl) %>% summarise(V1=cor(hp, wt))
Upvotes: 9
Reputation: 17189
In base
R, it's job for split
and lapply
or sapply
lapply(split(mtcars, mtcars$cyl), function(X) cor(X$hp, X$wt))
## $`4`
## [1] 0.1598761
##
## $`6`
## [1] -0.3062284
##
## $`8`
## [1] 0.01761795
##
sapply(split(mtcars, mtcars$cyl), function(X) cor(X$hp, X$wt))
## 4 6 8
## 0.15987614 -0.30622844 0.01761795
Upvotes: 10