Reputation: 274
I'm new to machine learning and seek some help. I would like to train a network to predict the next values I expect as follow:
reference: [val1 val2 ... val 15]
val = 0 if it doesn't exists, 1 if it does exist.
Input: [1 1 1 0 0 0 0 0 1 1 1 0 0 0 0]
Output: [1 1 1 0 0 0 0 0 1 1 1 0 0 1 1] (last two values appear)
So my neural network would have 15 inputs and 15 outputs
I would like to know if there would be a better way to do that kind of prediction. Do my data would need normalization also?
Now the problem is, I dont have 15 values, but actually 600'000 of them. Can a neural network handle such big tensors, and I've hear I would need twice the number for hidden layer units.
Thanks a lot for your help, you machine learning expert!
Best
Upvotes: 2
Views: 1868
Reputation: 77910
This is not a problem for the concept of a neural network: the question is whether your computing configuration and framework implementation deliver the required memory. Since you haven't described your topology, there's not a lot we can do to help you scope this out. What do you have for parameter and weight counts? Each of those is at least a short float (4 bytes). For instance, a direct FC (fully-connected) layer would give you (6e5)^2 weights, or 3.6e11 * 4 bytes => 1.44e12 bytes. Yes, that's pushing 1.5 terabytes for that layer's weights.
You can get around some of this with the style of NN you choose. For instance, splitting into separate channels (say, 60 channels of 1000 features each) can give you significant memory savings, albeit at the cost of speed in training (more layers) and perhaps some accuracy (although crossover can fix a lot of that). Convolutions can also save you overall memory, again at the cost of training speed.
600K => 4 => 600K
That clarification takes care of my main worries: you have 600,000 * 4 weights in each of two places: 1,200,004 parameters and 4.8M weights. That's 6M total floats, which shouldn't stress the RAM of any modern general-purpose computer.
The channelling idea is when you're trying to have a fatter connection between layers, such as 600K => 600K FC. In that case, you break up the data into smaller groups (usually just 2-12), and make a bunch of parallel fully-connected stream. For instance, you could take your input and make 10 streams, each of which is 60K => 60K FC. In your next layer, you swap the organization, "dealing out" each set of 60K so that 1/10 goes into each of the next channels.
This way, you have only 10 * 60K * 60K weights, only 10% as many as before ... but now there are 3 layers. Still, it's a 5x saving on memory required for weights, which is where you have the combinatorial explosion.
Upvotes: 2