Reputation: 273
Just started using TensorFlow to build LSTM networks for multiclass classification
Given the structure shown below: A RNN model Let's Assume each node A represents TensorFlow BasicLSTMcell.
According to some popular examples found online, the input for training is prepared as [batch_size, timeStep_size, feature_size]
let's Assume timeStep_size = 5, feature_size = 2, num_class = 4, given one training set : (dummy data)
t = t0 t1 t2 t3 t4
x = [ [1] [2] [2] [5] [2] ]
[ [2] [3] [3] [1] [2] ]
y = [ [0] [1] [1] [0] [0] ]
[ [1] [0] [0] [0] [0] ]
[ [0] [0] [0] [0] [1] ]
[ [0] [0] [0] [1] [0] ]
According to the popular usage:
...
# 1-layer LSTM with n_hidden units.
rnn_cell = rnn.BasicLSTMCell(n_hidden)
# generate prediction
outputs, states = rnn.static_rnn(rnn_cell, x, dtype=tf.float32)
return tf.matmul(outputs[-1], weights['out']) + biases['out']
It seems to me that the training of LSTM cell doesn't make use of all the five outputs of y (y at t0 - t3). Only y at time t4 is used for calculating the loss when compared to output[-1].
Question 1: is it the case that LSTM calculates/approximates y_t0 by itself, and feed into t1 to calculate y_t1, and so on... until it y_t4 is calculated?
If this is the case,
Question 2: what if y at t-1 is very important?
Example:
t = t-1 t0 t1 t2 t3 t4
x = [ [1] [2] [2] [2] [2] [2]]
[ [1] [2] [2] [2] [2] [2]]
y = [ [0] [1] [1] [1] [1] [1]]
[ [1] [0] [0] [0] [0] [0]]
[ [0] [0] [0] [0] [0] [0]]
[ [0] [0] [0] [0] [0] [0]]
VS:
t = t-1 t0 t1 t2 t3 t4
x = [ [3] [2] [2] [2] [2] [2]]
[ [3] [2] [2] [2] [2] [2]]
y = [ [0] [0] [0] [0] [0] [0]]
[ [0] [0] [0] [0] [0] [0]]
[ [1] [0] [0] [0] [0] [0]]
[ [0] [1] [1] [1] [1] [1]]
Which means that even though the input features from t0 to t4 are same, the output y are different since the previous outputs (y_t-1) are different.
Then how to deal with this kind of situation? how does TensorFlow set the output for t-1, when calculating the output at t0?
I've thought about increasing the timeStep_Size, but the real case might be very large, so I'm a bit confused...
Any pointers are highly appreciated!
Thank You in advance.
================= UPDATE ===============================
Re: jdehesa, Thanks Again.
Some additional background: my intention is to classify a long series of x, like below:
t = t0 t1 t2 t3 t4 t5 t6 t7 ...
x = [ [3] [2] [2] [2] [2] [2] [1] [2] [2] [2] [2] [2] ...]
[ [3] [2] [2] [2] [2] [2] [1] [2] [2] [2] [2] [2] ...]
y = [ c3 c2 c2 c2 c2 c2 c1 c4 c4 c4 c4 c4 ...]
Note: c1: class 1, c2: class2 c3: class 3, c4: class 4
The main confusion behind this post is that there are some known rules for manual classification. Take the dummy data above for example, assume there are rules that
if previous feature x is class 3 ([3, 3]), then all following [2, 2] will be class 2 until it reaches class 1.
if previous x is class 1 ([1, 1]), then all following [2, 2] will be class 4 until it reaches class 3.
In such case, if the LSTM only sees [5 by 2] feature vector (x) same as t1 to t4, the network will completely lost in wheather classify as class 2 or class 4. So what i mean is that not only do those features of the 5 time steps matter, so does the output/label of previous time step.
So restate the question: if now the training set is t1 to t5, so in addition to x [batch_size, t1:t5, 2], how to involve the label/class y at t0 as well.
Below are my reponse to your answer.
consider i use GRU instead of LSTM, where cell output and cell state are all represented by "h" as in understandign LSTM.
About the initial_state parameter: I just found that the dynamic_rnn and static_rnn take this parameter as you pointed out :D. if i were to solve the problem mentioned just now, can i use assign the previous class/label (y at t0) to initial_state param to before training, instead of using zeros_state.
I suddenly feel like i'm totally lost about the time span of LSTM memory. I've been thinking the time span of memory is limited by timeStep_size only. if timeStep_size = 5, the network can only recall up to 4 steps back, since every training we only feed [5 x 2] of x feature vector. please correct me if i'm wrong.
Again thank you so much
Upvotes: 0
Views: 1361
Reputation: 59691
LSTM cells, or RNN cells in general, have an internal state that gets updated after each time step is processed. Obviously, you cannot go infinitely back in time, so you gotta start at some point. The general convention is to begin with a cell state full of zeros; in fact, RNN cells in TensorFlow have a zero_state
method that return this kind of state for each particular cell type and size. If you are not happy with that starting point (for example, because you have processed half a sequence and now you want to process the other half, picking up at the same state you were), you can pass an initial_state
parameter to tf.nn.dynamic_rnn
.
About the training, I'm not sure what is the most popular usage of LSTM cells, but that's entirely up to you. I work on a problem where I have a label per time sample, and so my output is the same size as the input. However, in many cases you just want a label for the whole sequence (e.g. "this sentence is positive/negative"), so you just look at the last output. All the previous inputs are of course important too anyway, because they define the last cell state that is used in combination with the last input to determine the final output. For example, if you take a sentence like "That's cool, man" and process it word by word, the last word "man" will probably not tell you much about whether it is a positive or negative sentence by itself, but at that point the cell is in a state where it is pretty sure it is a positive sentence (that is, it would take a clearly negative input afterwards to make it produce a "negative" output).
I'm not sure what you mean about the t-1 thing... I mean if your input starts at t0 and you never saw t-1, there is nothing you can do about that (e.g. if you only got the input "really like this food" but turns out the whole original sentence was "not really like this food", you will just get it completely wrong). However, if you do have the input the network will learn to take it into account if it really is important. The whole point of LSTM cells is that they are able to remember things very far in the past (i.e. the effect of an input in the internal state can reach a very long time span).
Update:
About your additional comments.
You can use whatever you want as input state, of course. However, even with GRU the internal state does not usually match the output label. Typically, you would use a sigmoid or softmax activation after the recurrent unit, which would then produce an output be comparable to the labels.
About time spans. It is correct that using inputs with a small time step will constraint the ability of the recurrent units to learn long-term dependencies (to find patterns in longer sequences). As I mention, you can "emulate" a longer time step if you feed the last state of the recurrent units as initial state in the next run. But, whether you do that or not, it is not exact that the LSTM unit will just "not remember" things further back in the past. Even if you train with a time step of 5, if you then run the network with a sequence of size 100, the output for the last input will be (potentially) affected by all the 99 previous inputs; you simply will not be able to tell how much they affect, because that is a case that you did not have during training.
Upvotes: 1