Reputation: 149
Apologies if this is a trivial question, or if I am completely taking this problem by the wrong end.
Say I have a dataset that looks like this :
[A, [a,b,c,d]], [B, [e,f,g]], [C, [i,j,k,l,m]], ...
Capital letters represent large data chunks, and lowercase letters smaller chunks. Each large chunk is associated with a variable number of small chunks.
Now, I need to train my network like this : each input datapoint is a pair of type (big chunk, small chunk), associated with a target label.
(A,a) ----> label 1
(A,b) ----> label 2
(A,c) ----> label 3
(A,d) ----> label 4
(B,e) ----> label 5
(B,f) ----> label 6
...
and so on...
As you can see, the big data chunks are re-used across multiple inputs.
I would like to know the best way to input my initial dataset into Tensorflow.
Idea 1 : Obviously I could just straight away rearrange the dataset and turn it into a sequence of datapoints
(A,a),(A,b),(A,c),(A,d),(B,e),(B,f),...
But that would mean duplicating the large chunks, and be a waste of memory overall.
Idea 2 : I could divide the neural network into two sub-networks like this :
Big chunk ----> Network 1
\
\
Small chunk -----------\-----> Network 2 ----> Output
This seem more optimized, and I guess there would be a way to factor computation for multiple datapoints with the same big chunk. But how to tell Tensorflow to iterate over two dependent input datasets ?
Upvotes: 1
Views: 72
Reputation: 196
You should make your data into batches and feed every batch to your neural network. This concept not only solves your problem, It also scales your problem.
(A,a) ----> label 1
(A,b) ----> label 2
(A,c) ----> label 3
(A,d) ----> label 4
(B,e) ----> label 5
(B,f) ----> label 6
(C,e) ----> label 5
(C,f) ----> label 6
into
Batch 1: (A,a),(A,b),(B,e),(C,f),...
Batch 2: (A,c),(A,d),(C,e),(B,f)...
Apply your cost function. Choose an optimizer and start training your network.
Upvotes: 1