Demetri Pananos
Demetri Pananos

Reputation: 7394

Calculating the standard error via matrix computations

When doing linear regression, the standard error for a predicted value is given by np.sqrt(x.T @ sigma @ x). Here, x is an array of covariates and sigma is a positive definite matrix (the covariance matrix of the regression coefficients).

In regression, we usually concatenate the covariates rowwise in an array X, so that X[:, i] is the i-th observation to be predicted.

One way to estimate the standard error of the prediction in matrix form is then np.sqrt((X.T * ([email protected])).sum(0)). Is there a way to do this more...linear algebraically? In particular in avoiding the element-wise multiplication and sum? There is nothing wrong with the way I've presented this approach, I'm just curious.

Upvotes: 0

Views: 567

Answers (1)

user10289025
user10289025

Reputation:

You mean to write the equation in terms of blas operations? In that case yes but that does not improve the speed. To see this let us write it in indices notation

ij,jk,ik->i

Clearly you can perform a matrix multiplication for first two arrays which is highly optimised. After the matrix multiplication we are left with

ik,ik->i So we need to find optimal way for this. This is clearly a batched dot product. Either you can do (product and sum) or a batched dot products. Since dot mostly gets the speed by Vectorization, so you don't see any improvement (Numpy also uses Vectorization when performing sum and product) .

import numpy as np

x = np.random.rand(4000,4000)
y = np.random.rand(4000,4000)

%timeit (x.T*([email protected])).sum(0)

%timeit np.einsum('ij,jk,ik->i',x,y,x,optimize=True) ## multiplication followed by batched dot. can be bit slower due to overhead for finding the optimal path

%timeit ((x@y)[:,None,:]@x[:,:,None]).ravel() ## multiplication followed by batched dot

%timeit ((x@y)*x).sum(1) # less cache misses as the last dimension is contiguous in memory

# 1.15 s ± 53.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 1.07 s ± 52.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 1.03 s ± 64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# 1.02 s ± 40.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Cleary there is only slight improvement. But this is mostly because, you transpose and sum over the first dimension which is not cache friendly. If we remove the transpose and sum over the last dimension, you get the same speed as shown in the last method.

Upvotes: 1

Related Questions