Reputation: 260
I want to whiten the CIFAR10 dataset using ZCA. The input X_train
is of shape (40000, 32, 32, 3) where 40000 is the number of images, and 32x32x3 is the size of each image. I'm using the code from this answer for this purpose:
X_flat = np.reshape(X_train, (-1, 32*32*3))
# compute the covariance of the image data
cov = np.cov(X_flat, rowvar=True) # cov is (N, N)
# singular value decomposition
U,S,V = np.linalg.svd(cov) # U is (N, N), S is (N,)
# build the ZCA matrix
epsilon = 1e-5
zca_matrix = np.dot(U, np.dot(np.diag(1.0/np.sqrt(S + epsilon)), U.T))
# transform the image data zca_matrix is (N,N)
zca = np.dot(zca_matrix, X_flat) # zca is (N, 3072)
However, at run time I encountered the following warning:
D:\toolkits.win\anaconda3-5.2.0\envs\dlwin36\lib\site- packages\ipykernel_launcher.py:8: RuntimeWarning: invalid value encountered in sqrt
So after I got the SVD output, I tried:
print(np.min(S)) # prints -1.7798217
Which is unexpected because S
can only have positive values. Also, the ZCA whitening result was not correct and it contained nan
values.
I tried reproducing this by re-running this same code a second time and this time I did not encounter any warnings or any negative S
values, but instead I got:
print(np.min(S)) # prints nan
Any idea for why this might have happened?
Update: Restarted the kernel to free up cpu and RAM resources, and tried running this code again. Again got the same warning for feeding in negative values to np.sqrt()
. Not sure if it helps but I've also attached the cpu and ram utilization figures:
Upvotes: 4
Views: 1377
Reputation: 6927
Here are a couple of ideas. I don't have your dataset so I can't be totally sure that these will fix your problem, but I'm confident enough to post this as an answer instead of a comment.
First. Your X_train
is 40'000 by 3072, where each row is a data vector, and each column is a variable or feature. You want the covariance matrix that is 3072 by 3072: pass in rowvar=False
to np.cov
.
I'm not really sure why the 40'000 by 40'000 covariance matrix's SVD is diverging. Assuming you have enough RAM to store the 12 GB covariance matrix, the one thing I can think of is numerical overflow, because you're perhaps not removing the mean of the data, as is expected by ZCA (and any other whitening technique)?
So second. Remove the mean: X_zeromean = X_flat - np.mean(X_flat, 0)
.
If you do these, then the final step has to be modified a tiny bit (to make dimensions line up). Here's a quick check using uniform random data:
import numpy as np
X_flat = np.random.rand(40000, 32*32*3)
X_zeromean = X_flat - np.mean(X_flat, 0)
cov = np.cov(X_zeromean, rowvar=False)
U,S,V = np.linalg.svd(cov)
epsilon = 1e-5
zca_matrix = np.dot(U, np.dot(np.diag(1.0/np.sqrt(S + epsilon)), U.T))
zca = np.dot(zca_matrix, X_zeromean.T) # <-- transpose needed here
As a sanity check np.cov(zca)
now is very close to the identity matrix, as desired (zca
will have flipped dimensions as the input).
(As a sidenote, this is a really expensive and numerically unstable way to whiten the data array: you don't need to compute the covariance and then take the SVD—you're doing twice the work. You can take the skinny SVD of the data matrix itself (np.linalg.svd
with the full_matrices=False
flag) and compute the whitening matrix directly from there, without ever evaluating the expensive outer product for the covariance matrix.)
Upvotes: 2