Reputation: 21
I'm trying to code in Python several types of ANN algorithms in order to get a better understanding/intuition of those. I'm not using Scikit-learn or any other ready-to-go packages since my goal is rather educational than practical. As an example problem, I use MNIST database (http://yann.lecun.com/exdb/mnist/).
While I performed simple 1-hidden layer NN and convolutional NN, I successfully avoided any second-order methods of optimization and, thus, didn't compute Hessian matrix. However, then I got to Bayesian NN where, in order to optimize hyperparameters, a computation of Hessian is compulsatory.
In my fully connected network, there are 784 inputs, 300 hidden units, and 10 output units. All of those result in 238200 weights (+ biases). When I try to compute or even approximate Hessian (by outer product of gradients), Python notifies on "MemoryError". Even if I decrease the number of weights to ~40000 and no error message is displayed, my computer gets stuck after several minutes. As I understand, the problem is that the desirable matrix is extremely huge. I looked through a couple of articles on Bayesian NNs and noticed that authors usually use network architectures of no more than 10 or 20 inputs and hidden units, thus having a lot fewer parameters than I have. However, I have not seen any explicit statements of such restrictions.
What can I do in order to apply Bayesian approach to NN for MNIST?
More generally: Is it possible to apply Bayesian approach with this (238200 weights) or even larger architecture? Or maybe it is suitable just for relatively small networks?
Upvotes: 2
Views: 811
Reputation: 4329
You could try the BFGS algorithm for gradient ascent, which approximates the Hessian and tends to save (considerable) memory. There's an implementation in Scipy.
Upvotes: 2