user48956
user48956

Reputation: 15794

Efficient calculation of the diagonal of hat matrix: inv(X'WX)'X'

As part of a disease risk model I'm trying to implement a computation from a paper (in Python/numpy), part of which is the following matrix computation:

foo+baz

where:

Also, I only need to get the diagonal elements of Q as the output.

Is there's some numpy matrix magic that allows me to calculate this efficiently?

Note:

The paper has an implementation in R which I (believe) does this as follows:

Qdiag <- lm.influence(lm(y ~ X-1, weights=W))$hat/W

R's docs for lm.influence$hat says this gives "a vector containing the diagonal of the ‘hat’ matrix." Which kinda sounds like what I want, although Wikipedia's definition of the hat matrix (==influence or projection matrix), looks slightly different.

--

I think the following is a valid (naive) implementation. It runs out of memory for large n

m = 3
n = 20 # 500000 -- out of memory for large n

np.random.seed(123)
X = np.random.random((n,m))
W = np.random.random(n)
W = np.diag(W)
xtwx = X.T.dot(W.dot(X))
xtwxi = np.linalg.pinv(xtwx)
xtwxixt = xtwxi.dot(X.T)
Q = X.dot(xtwxixt)
Qdiag = np.diag(Q)
print Qdiag.shape, Qdiag.sum() # Checksum of output 
print Qdiag

Upvotes: 5

Views: 1458

Answers (1)

sascha
sascha

Reputation: 33512

(In a now deleted comment i said it's impossible given some density- and hardware-assumptions treating it as black-box. But it seems it can be done. That does not mean it's the correct approach!)

So without analyzing the background of this formula, we can do some basic approaches given minimal assumptions and classic rules like:

  • A: associativity of matrix-multiplication
  • B: solving a system of linear-equations instead of inverting
    • we assume XtWX is non-singular
  • C: recognizing A*W (with W diagonal-only) is a row-wise elementwise-product with the diagonal-vector
  • D: calculating the diagonal-entries of Q only (or else we would have a result matrix of N*N = 2.5e8 entries for your numbers)

Code:

import numpy as np
from time import perf_counter as pc     # python 3 only

m = 200
n = 500000

np.random.seed(123)
X = np.random.random((n,m))
W_diag = np.random.random(n)            # C -> dense vector

start_time = pc()

lhs = np.multiply(X.T, W_diag).dot(X)   # C (+A)
x = np.linalg.solve(lhs, X.T)           # B

# EDIT: Paul Panzer recommends the inverse in his comment based on the rhs-dims!

# if you know something about lhs (looks symmetric; maybe even PSD)
# use one of the following for speedups and more robustness
# i highly recommend this research: feels PSD
# import scipy.linalg as slin
# x = slin.solve(lhs, X.T, assume_a='sym')
# x = slin.solve(lhs, X.T, assume_a='pos')

Q_ = np.einsum('ij,ji->i', X,x)         # D most important -> N*N elsewise
print(Q_)

end_time = pc()
print(end_time - start_time)

Out:

[ 0.00068103  0.00083676  0.00080945 ...,  0.00077864  0.00078945
  0.0007804 ]
3.1077745566331165  # seconds

Result is the same compared to your code given for your single test-case!

In general i would recommend going by the underlying math and not by the extracted formula itself. As the paper basically says the more complete problem is a weighted least-squares problem, this is a good start for some research.

Upvotes: 3

Related Questions