Reputation: 615
I have a large dataframe. I want to calculate the correlation coefficient between hot and index, by class
ID hot index class
41400 10 2 a
41400 12 2 a
41400 75 4 a
41401 89 5 a
41401 25 3 c
41401 100 6 c
20445 67 4 c
20445 89 6 c
20445 4 1 c
20443 67 5 d
20443 120.2 7 a
20443 140.5 8 d
20423 170.5 10 d
20423 78.1 5 c
Intended output
a = 0.X (assumed numbers)
b = 0.Y
c = 0.Z
I know I can use the by command, but I am not able to.
Code
cor_eqn = function(df){
m = cor(hot ~ index, df);
}
by(df,df$class,cor_eqn,simplify = TRUE)
Upvotes: 1
Views: 1614
Reputation: 7895
You can use dplyr
for this:
library(dplyr)
gp = group_by(dataset, class)
correl = dplyr::summarise(gp, correl = cor(hot, index))
print(correl)
# class correl
# a 0.9815492
# c 0.9753372
# d 0.9924337
Note that class
and df
are R functions, names like these can cause trouble.
Upvotes: 0
Reputation: 18602
Another option is to use a data.table
instead of a data.frame
. You can just call setDT(df)
on your existing data.frame
(I created a data.table
initially below):
library(data.table)
##
set.seed(123)
DT <- data.table(
ID=1:50000,
class=rep(
letters[1:4],
each=12500),
hot=rnorm(50000),
index=rgamma(50000,shape=2))
## set key for better performance
## with large data set
setkeyv(DT,class)
##
> DT[,list(Correlation=cor(hot,index)),by=class]
class Correlation
1: a 0.005658200
2: b 0.001651747
3: c -0.002147164
4: d -0.006248392
Upvotes: 2