Alex
Alex

Reputation: 15708

Correlation of subsets of dataframe using aggregate

I have a data frame made by row binding many data frames, each identified with a unique key. I wish to calculate the correlation coefficients for columns in each subset (using the unique key) of the big data frame. For example, using the mtcars data I might want to calculate the correlation between columns hp and wt for each unique value in column cyl. I could do it in a loop

data("mtcars")
for(i in c(4,6,8)){
temp = subset(mtcars,mtcars$cyl==i)
cor(temp$hp,temp$wt)
}

I think aggregate would be better, but this code doesn't work:

data("mtcars")
aggregate(mtcars,by=mycars$cyl,cor)

Upvotes: 7

Views: 8907

Answers (2)

cryo111
cryo111

Reputation: 4474

You could use

 data("mtcars")
 library(plyr)
 ddply(mtcars, "cyl", function(x) cor(x$hp, x$wt))

This splits the data in mtcars by cyl, applies for each subset x the function cor(x$hp, x$wt) and then aggregates the results for each of the subsets in a data.frame.

I can highly recommend the plyr package. It's one of the packages I use most in R.


Edit: As per request, here a dplyr version. I have to say that I am not a big dplyr user, but the code should be ok.

library(dplyr)
mtcars %>% group_by(cyl) %>% summarise(V1=cor(hp, wt))

Upvotes: 9

CHP
CHP

Reputation: 17189

In base R, it's job for split and lapply or sapply

lapply(split(mtcars, mtcars$cyl), function(X) cor(X$hp, X$wt))
## $`4`
## [1] 0.1598761
## 
## $`6`
## [1] -0.3062284
## 
## $`8`
## [1] 0.01761795
## 


sapply(split(mtcars, mtcars$cyl), function(X) cor(X$hp, X$wt))
##           4           6           8 
##  0.15987614 -0.30622844  0.01761795 

Upvotes: 10

Related Questions