Reputation: 412
I have a large dataframe "Im" that looks like this
V1 V7 X134 X135 X136 X137 X138
1 m 1000 543.360 1057.770 1869.42 2664.06 3935.307
2 m 2000 767.256 1704.430 2993.63 5248.06 6341.129
3 m 3000 413.096 796.168 1441.13 3500.46 2962.048
4 a 4000 257.128 559.200 1014.79 2948.64 2080.437
5 a 5000 188.504 440.640 813.60 2538.11 1639.349
6 a 6000 483.704 921.064 1679.98 3626.44 3426.709
....
I want to find the correlation coefficient between column X135 and all the other numbered columns, so in other words, I essentially need this
> cor(Im$X135,Im$X136)
> cor(Im$X135,Im$X134)
> cor(Im$X135,Im$X137)
The problem is I need the correlation grouped by "V1". The output I am looking for should be something like this (using hypothetical correlation coefficients)
V1 cc134 cc136 cc137
1 m 0.92 0.99 0.95
1 a 0.99 0.93 0.89
I have looked at tapply, ddply, aggregate and everything I found was for row wise functions like sum and average, resulting in an output which has the same columns. I am new to R so couldn't figure out how to write a clever function to do this. I have considered reshaping the data, but didn't get anywhere with that either. Any and all help appreciated!
Upvotes: 1
Views: 8362
Reputation: 3280
This type of problem is best suited by data.table
package.
Here is a simple attempt:
library(data.table)
data <- read.table(text=" V1 V7 134 135 136 137 138
1 m 1000 543.360 1057.770 1869.42 2664.06 3935.307
2 m 2000 767.256 1704.430 2993.63 5248.06 6341.129
3 m 3000 413.096 796.168 1441.13 3500.46 2962.048
4 a 4000 257.128 559.200 1014.79 2948.64 2080.437
5 a 5000 188.504 440.640 813.60 2538.11 1639.349
6 a 6000 483.704 921.064 1679.98 3626.44 3426.709",header=T)
data <- data.table(data)
setkey(data,V1)
data[,list(cc134=cor(X135,X134),cc136=cor(X135,X136),cc137=cor(X135,X137)),by=key(data)]
To learn more about the package:
vignette("datatable-intro")
vignette("datatable-faq")
vignette("datatable-timings")
Or see it in action:
example(data.table)
Upvotes: 3