Reputation: 361
I have data of the form:
df <- data.frame(group = c(rep(1,5),rep(2,5),rep(3,5),rep(4,5),rep(5,5)),
thing = c(rep(c('a','b','c','d','e'),5)),
score = c(1,1,0,0,1,1,1,0,1,0,1,1,1,0,0,0,1,1,0,1,0,1,0,1,0))
which reports the "score" for each "thing" for a bunch of "group"s.
I would like to create the correlation matrix that shows the pairwise score correlations for all "thing"s based on the correlation in their scores across groups:
thing_a thing_b thing_c thing_d thing_e
thing_a 1 . . . .
thing_b corr 1 . . .
thing_c corr corr 1 . .
thing_d corr corr corr 1 .
thing_e corr corr corr corr 1
For example, the data underlying the correlation between thing "a" and thing "b" would be:
group thing_a_score thing_b_score
1 1 1
2 1 1
3 1 1
4 0 1
5 0 1
In reality, the number of unique groups is ~1,000 and the number of things is ~10,000 so I need an approach that is more efficient than a brute force for-loop.
I don't need the resulting matrix of correlations to be in a single matrix, or even in a matrix per-se (i.e., it could be a bunch of data sets with three columns "thing_1 thing_2 corr
").
Upvotes: 1
Views: 829
Reputation: 6969
You can dcast
your data first and use cor()
function to get the correlation matrix:
library(data.table)
dt <- data.table(
group = c(rep(1,5),rep(2,5),rep(3,5),rep(4,5),rep(5,5)),
thing = c(rep(c('a','b','c','d','e'),5)),
score = c(1,1,0,0,1,1,1,0,1,0,1,1,1,0,0,0,1,1,0,1,0,1,0,1,0)
)
dt
m <- dcast(dt, group ~ thing, value.var = "score")
cor(m[, -1])
data.table
is usually performant, but if it is not working for you please write a reproducible example that generates large amount of data, somebody might benchmark speed and memory on different solutions.
Upvotes: 2