Reputation: 15794
As part of a disease risk model I'm trying to implement a computation from a paper (in Python/numpy), part of which is the following matrix computation:
where:
Also, I only need to get the diagonal elements of Q as the output.
Is there's some numpy matrix magic that allows me to calculate this efficiently?
Note:
The paper has an implementation in R which I (believe) does this as follows:
Qdiag <- lm.influence(lm(y ~ X-1, weights=W))$hat/W
R's docs for lm.influence$hat says this gives "a vector containing the diagonal of the ‘hat’ matrix." Which kinda sounds like what I want, although Wikipedia's definition of the hat matrix (==influence or projection matrix), looks slightly different.
--
I think the following is a valid (naive) implementation. It runs out of memory for large n
m = 3
n = 20 # 500000 -- out of memory for large n
np.random.seed(123)
X = np.random.random((n,m))
W = np.random.random(n)
W = np.diag(W)
xtwx = X.T.dot(W.dot(X))
xtwxi = np.linalg.pinv(xtwx)
xtwxixt = xtwxi.dot(X.T)
Q = X.dot(xtwxixt)
Qdiag = np.diag(Q)
print Qdiag.shape, Qdiag.sum() # Checksum of output
print Qdiag
Upvotes: 5
Views: 1458
Reputation: 33512
(In a now deleted comment i said it's impossible given some density- and hardware-assumptions treating it as black-box. But it seems it can be done. That does not mean it's the correct approach!)
So without analyzing the background of this formula, we can do some basic approaches given minimal assumptions and classic rules like:
Code:
import numpy as np
from time import perf_counter as pc # python 3 only
m = 200
n = 500000
np.random.seed(123)
X = np.random.random((n,m))
W_diag = np.random.random(n) # C -> dense vector
start_time = pc()
lhs = np.multiply(X.T, W_diag).dot(X) # C (+A)
x = np.linalg.solve(lhs, X.T) # B
# EDIT: Paul Panzer recommends the inverse in his comment based on the rhs-dims!
# if you know something about lhs (looks symmetric; maybe even PSD)
# use one of the following for speedups and more robustness
# i highly recommend this research: feels PSD
# import scipy.linalg as slin
# x = slin.solve(lhs, X.T, assume_a='sym')
# x = slin.solve(lhs, X.T, assume_a='pos')
Q_ = np.einsum('ij,ji->i', X,x) # D most important -> N*N elsewise
print(Q_)
end_time = pc()
print(end_time - start_time)
Out:
[ 0.00068103 0.00083676 0.00080945 ..., 0.00077864 0.00078945
0.0007804 ]
3.1077745566331165 # seconds
Result is the same compared to your code given for your single test-case!
In general i would recommend going by the underlying math and not by the extracted formula itself. As the paper basically says the more complete problem is a weighted least-squares problem, this is a good start for some research.
Upvotes: 3