Reputation: 8049
I have no idea how to tackle this problem, the only thing I can think of is a brute force loop, but I'm not even sure how to loop through the rows of a data.table
in a sensible way.
I have a double keyed data.table
and a correlation matrix based on the first of those keys. I need to build the full correlation matrix for all elements, by looking up the correlation for any given pair, which is zero if the second key doesn't match.
Simplified Example:
library(data.table)
DT = data.table(Key1 = c("A", "A", "A", "B", "B", "C", "C"), Key2 = c(1,2,3,2,3,3,4), OtherData = "Irrelevant")
setkey(DT, Key2, Key1)
M = matrix(c(1.0, 0.4, 0.3,
0.4, 1.0, 0.2,
0.3, 0.2, 1.0), nrow = 3)
So our starting data.table looks like:
> DT
Key1 Key2 OtherData
1: A 1 Irrelevant
2: A 2 Irrelevant
3: B 2 Irrelevant
4: A 3 Irrelevant
5: B 3 Irrelevant
6: C 3 Irrelevant
7: C 4 Irrelevant
And the pre-defined correlation matrix for the A, B & C when they share the same Key2 value, is given by M:
> M
[,1] [,2] [,3]
[1,] 1.0 0.4 0.3
[2,] 0.4 1.0 0.2
[3,] 0.3 0.2 1.0
And I now need to make a 7x7 matrix that would look like:
> result
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 1.0 0 0 0 0 0 0
[2,] 0 1.0 0.4 0 0 0 0
[3,] 0 0.4 1.0 0 0 0 0
[4,] 0 0 0 1.0 0.4 0.3 0
[5,] 0 0 0 0.4 1.0 0.2 0
[6,] 0 0 0 0.3 0.2 1.0 0
[7,] 0 0 0 0 0 0 1.0
Where we have created the block diagonal matrix using the parts of M that match the Key1 values available at each Key2 (Key2 is effectively time).
Upvotes: 3
Views: 396
Reputation: 118799
Here's one way (not sure how it scales though):
rownames(M) <- colnames(M) <- LETTERS[1:3]
ans <- DT[, list(idx1=.I, idx2=rep(.I, each=.N),
val=as.vector(M[Key1, Key1])), by=Key2]
dcast.data.table(ans, idx2 ~ idx1, value.var="val", fill=0L)
# idx2 1 2 3 4 5 6 7
# 1: 1 1 0.0 0.0 0.0 0.0 0.0 0
# 2: 2 0 1.0 0.4 0.0 0.0 0.0 0
# 3: 3 0 0.4 1.0 0.0 0.0 0.0 0
# 4: 4 0 0.0 0.0 1.0 0.4 0.3 0
# 5: 5 0 0.0 0.0 0.4 1.0 0.2 0
# 6: 6 0 0.0 0.0 0.3 0.2 1.0 0
# 7: 7 0 0.0 0.0 0.0 0.0 0.0 1
dcast.data.table
is available from data.table versions >= 1.9.0
. The current stable CRAN version at the time of writing is 1.9.2.
Upvotes: 5
Reputation: 8691
This is edited to use native data.table()
features - hopefully it should perform better!
# make the cor matrix into an expand.grid equivalent - all combos - using CJ for cross join
cor_list<-data.table(CJ(LETTERS[1:nrow(M)],LETTERS[1:nrow(M)]))
# fill with the values for M
cor_list[,cor:=unlist(as.list(M))]
# index on combination of correlation inputs
setkey(cor_list, V1, V2)
# lookup correlation for all combos of DT v DT
DTX<-DT[,cor_list[J(Key1,DT[,Key1],DT[,Key2])],by=c("Key1","Key2")]
# index on Key2
setkey(DTX,Key2)
# Set cor=0 where Key2 doesn't match (OK, it's a bit of a hack!)
DTX[Key2!=V3,cor:=0]
# fill a matrix with the vector of correlations (it fits)
# original length of DT gives you the length of side
matrix(DTX[,cor],nrow(DT))
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 1 0.0 0.0 0.0 0.0 0.0 0
[2,] 0 1.0 0.4 0.0 0.0 0.0 0
[3,] 0 0.4 1.0 0.0 0.0 0.0 0
[4,] 0 0.0 0.0 1.0 0.4 0.3 0
[5,] 0 0.0 0.0 0.4 1.0 0.2 0
[6,] 0 0.0 0.0 0.3 0.2 1.0 0
[7,] 0 0.0 0.0 0.0 0.0 0.0 1
EDITED ABOVE - DOUBLE APPLY SLOW AS PER ROLAND'S COMMENT
How about this?
#function to return letter corresponding to number
lookup_letter<-function(let){match(let,matrix(c("A","B","C")))}
then nest 2 apply calls for each dimension of the matrix
apply(DT,1,function(x){ # call row-wise
apply(DT,1,function(y)ifelse(y[2]==x[2],M[lookup_letter(x[1]),lookup_letter(y[1])],0)) # call column-wise lookup
})
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 1 0.0 0.0 0.0 0.0 0.0 0
[2,] 0 1.0 0.4 0.0 0.0 0.0 0
[3,] 0 0.4 1.0 0.0 0.0 0.0 0
[4,] 0 0.0 0.0 1.0 0.4 0.3 0
[5,] 0 0.0 0.0 0.4 1.0 0.2 0
[6,] 0 0.0 0.0 0.3 0.2 1.0 0
[7,] 0 0.0 0.0 0.0 0.0 0.0 1
Probably there are better ways to lookup your correlation number, but this gives you an idea (maybe flatten M into an indexed list)
Upvotes: 0
Reputation: 18323
My data.table
skills aren't that strong, but I came up with a solution that takes advantage of the indices, but only if I added the row numbers.
# DT$row<-1:nrow(DT) # No longer necessary.
# Add dimension names to matrix for convenience
rownames(M)<-colnames(M)<-c('A','B','C')
f<-function(k1,k2) {
# rows<-DT[.(k2)]$row
rows<-DT[.(k2),.I]$.I
ret<-rep(0,nrow(DT))
ret[rows]<-M[DT[.(k2)]$Key1,k1]
ret
}
mapply(f,DT$Key1,DT$Key2)
# A A B A B C C
# [1,] 1 0.0 0.0 0.0 0.0 0.0 0
# [2,] 0 1.0 0.4 0.0 0.0 0.0 0
# [3,] 0 0.4 1.0 0.0 0.0 0.0 0
# [4,] 0 0.0 0.0 1.0 0.4 0.3 0
# [5,] 0 0.0 0.0 0.4 1.0 0.2 0
# [6,] 0 0.0 0.0 0.3 0.2 1.0 0
# [7,] 0 0.0 0.0 0.0 0.0 0.0 1
This should be a little better in the sense that the indices will be called. More efficient solutions might take advantage of the known diagonal nature of the output matrix. I wonder if there is a way to do this without adding the row numbers? The comment below indicated one way of getting the row number, I have implemented it above.
Upvotes: 1
Reputation: 6534
This does what you want:
1.Set up the data.
DT <- data.frame(Key1 = c("A", "A", "B", "A", "B", "C", "C"), Key2 = c(1, 2, 2, 3, 3, 3, 4))
M <- matrix(c(1, 0.4, 0.3, 0.4, 1, 0.2, 0.3, 0.2, 1), nrow = 3)
2.Subset the matrix, grouping by Key2 (this returns a list).
BD <- by(DT, DT$Key2, function(df) {N = as.numeric(df$Key1); M[N,N]})
3.Construct a block-diagonal matrix.
library(magic)
do.call(adiag, BD)
Upvotes: 1