casparjespersen
casparjespersen

Reputation: 3830

Convergence of LSTM network using Tensorflow

I am trying to detect micro-events in a long time series. For this purpose, I will train a LSTM network.

Data. Input for each time sample is 11 different features somewhat normalized to fit 0-1. Output will be either one of two classes.

Batching. Due to huge class imbalance I have extracted the data in batches of each 60 time samples, of which at least 5 will always be class 1, and the rest class to. In this way the class imbalance is reduced from 150:1 to around 12:1 I have then randomized the order of all my batches.

Model. I am attempting to train an LSTM, with initial configuration of 3 different cells with 5 delay steps. I expect the micro events to arrive in sequences of at least 3 time steps.

Problem: When I try to train the network it will quickly converge towards saying that EVERYTHING belongs to the majority class. When I implement a weighted loss function, at some certain threshold it will change to saying that EVERYTHING belongs to the minority class. I suspect (without being expert) that there is no learning in my LSTM cells, or that my configuration is off?

Below is the code for my implementation. I am hoping that someone can tell me

ar_model.py

import numpy as np
import tensorflow as tf
from tensorflow.models.rnn import rnn
import ar_config

config = ar_config.get_config()


class ARModel(object):

    def __init__(self, is_training=False, config=None):

        # Config
        if config is None:
            config = ar_config.get_config()

        # Placeholders
        self._features = tf.placeholder(tf.float32, [None, config.num_features], name='ModelInput')
        self._targets = tf.placeholder(tf.float32, [None, config.num_classes], name='ModelOutput')

        # Hidden layer
        with tf.variable_scope('lstm') as scope:
            lstm_cell = tf.nn.rnn_cell.BasicLSTMCell(config.num_hidden, forget_bias=0.0)
            cell = tf.nn.rnn_cell.MultiRNNCell([lstm_cell] * config.num_delays)
            self._initial_state = cell.zero_state(config.batch_size, dtype=tf.float32)
            outputs, state = rnn.rnn(cell, [self._features], dtype=tf.float32)

        # Output layer
        output = outputs[-1]
        softmax_w = tf.get_variable('softmax_w', [config.num_hidden, config.num_classes], tf.float32)
        softmax_b = tf.get_variable('softmax_b', [config.num_classes], tf.float32)
        logits = tf.matmul(output, softmax_w) + softmax_b

        # Evaluate
        ratio = (60.00 / 5.00)
        class_weights = tf.constant([ratio, 1 - ratio])
        weighted_logits = tf.mul(logits, class_weights)
        loss = tf.nn.softmax_cross_entropy_with_logits(weighted_logits, self._targets)
        self._cost = cost = tf.reduce_mean(loss)
        self._predict = tf.argmax(tf.nn.softmax(logits), 1)
        self._correct = tf.equal(tf.argmax(logits, 1), tf.argmax(self._targets, 1))
        self._accuracy = tf.reduce_mean(tf.cast(self._correct, tf.float32))
        self._final_state = state

        if not is_training:
            return

        # Optimize
        optimizer = tf.train.AdamOptimizer()
        self._train_op = optimizer.minimize(cost)


    @property
    def features(self):
        return self._features

    @property
    def targets(self):
        return self._targets

    @property
    def cost(self):
        return self._cost

    @property
    def accuracy(self):
        return self._accuracy

    @property
    def train_op(self):
        return self._train_op

    @property
    def predict(self):
        return self._predict

    @property
    def initial_state(self):
        return self._initial_state

    @property
    def final_state(self):
        return self._final_state

ar_train.py

import os
from datetime import datetime
import numpy as np
import tensorflow as tf
from tensorflow.python.platform import gfile
import ar_network
import ar_config
import ar_reader

config = ar_config.get_config()


def main(argv=None):

    if gfile.Exists(config.train_dir):
        gfile.DeleteRecursively(config.train_dir)
        gfile.MakeDirs(config.train_dir)

    train()

def train():
    train_data = ar_reader.ArousalData(config.train_data, num_steps=config.max_steps)
    test_data = ar_reader.ArousalData(config.test_data, num_steps=config.max_steps)

    with tf.Graph().as_default(), tf.Session() as session, tf.device('/cpu:0'):
        initializer = tf.random_uniform_initializer(minval=-0.1, maxval=0.1)

        with tf.variable_scope('model', reuse=False, initializer=initializer):
            m = ar_network.ARModel(is_training=True)
            s = tf.train.Saver(tf.all_variables())

        tf.initialize_all_variables().run()

        for batch_input, batch_target in train_data:
            step = train_data.iter_steps

            dict = {
                m.features: batch_input,
                m.targets: batch_target
            }

            session.run(m.train_op, feed_dict=dict)
            state, cost, accuracy = session.run([m.final_state, m.cost, m.accuracy], feed_dict=dict)

            if not step % 10:
                test_input, test_target = test_data.next()
                test_accuracy = session.run(m.accuracy, feed_dict={
                    m.features: test_input,
                    m.targets: test_target
                })
                now = datetime.now().time()
                print ('%s | Iter %4d | Loss= %.5f | Train= %.5f | Test= %.3f' % (now, step, cost, accuracy, test_accuracy))

            if not step % 1000:
                destination = os.path.join(config.train_dir, 'ar_model.ckpt')
                s.save(session, destination)

if __name__ == '__main__':
    tf.app.run()

ar_config.py

class Config(object):

    # Directories
    train_dir = '...'
    ckpt_dir = '...'
    train_data = '...'
    test_data = '...'

    # Data
    num_features = 13
    num_classes = 2
    batch_size = 60

    # Model
    num_hidden = 3
    num_delays = 5

    # Training
    max_steps = 100000


def get_config():
    return Config()

UPDATED ARCHITECTURE:

# Placeholders
self._features = tf.placeholder(tf.float32, [None, config.num_features, config.num_delays], name='ModelInput')
self._targets = tf.placeholder(tf.float32, [None, config.num_output], name='ModelOutput')

# Weights
weights = {
    'hidden': tf.get_variable('w_hidden', [config.num_features, config.num_hidden], tf.float32),
    'out': tf.get_variable('w_out', [config.num_hidden, config.num_classes], tf.float32)
}
biases = {
    'hidden': tf.get_variable('b_hidden', [config.num_hidden], tf.float32),
    'out': tf.get_variable('b_out', [config.num_classes], tf.float32)
}

#Layer in
with tf.variable_scope('input_hidden') as scope:
    inputs = self._features
    inputs = tf.transpose(inputs, perm=[2, 0, 1])  # (BatchSize,NumFeatures,TimeSteps) -> (TimeSteps,BatchSize,NumFeatures)
    inputs = tf.reshape(inputs, shape=[-1, config.num_features]) # (TimeSteps,BatchSize,NumFeatures -> (TimeSteps*BatchSize,NumFeatures)
    inputs = tf.add(tf.matmul(inputs, weights['hidden']), biases['hidden'])

#Layer hidden
with tf.variable_scope('hidden_hidden') as scope:
    inputs = tf.split(0, config.num_delays, inputs) # -> n_steps * (batchsize, features)
    cell = tf.nn.rnn_cell.BasicLSTMCell(config.num_hidden, forget_bias=0.0)
    self._initial_state = cell.zero_state(config.batch_size, dtype=tf.float32)
    outputs, state = rnn.rnn(cell, inputs, dtype=tf.float32)

#Layer out
with tf.variable_scope('hidden_output') as scope:
    output = outputs[-1]
    logits = tf.add(tf.matmul(output, weights['out']), biases['out'])

Upvotes: 1

Views: 5355

Answers (2)

Jon Gauthier
Jon Gauthier

Reputation: 25572

Gunnar has already made lots of good suggestions. A few more small things worth paying attention to in general for this sort of architecture:

  • Try tweaking the Adam learning rate. You should determine the proper learning rate by cross-validation; as a rough start, you could just check whether a smaller learning rate saves your model from crashing on the training data.
  • You should definitely use more hidden units. It's cheap to try larger networks when you first start out on a dataset. Go as large as necessary to avoid the underfitting you've observed. Later you can regularize / pare down the network after you get it to learn something useful.

Concretely, how long are the sequences you are passing into the network? You say you have a 30k-long time sequence.. I assume you are passing in subsections / samples of this sequence?

Upvotes: 1

Alexander R Johansen
Alexander R Johansen

Reputation: 2817

Odd elements

Weighted loss

I am not sure your "weighted loss" does what you want it to do:

    ratio = (60.00 / 5.00)
    class_weights = tf.constant([ratio, 1 - ratio])
    weighted_logits = tf.mul(logits, class_weights)

this is applied before calculating the loss function (further I think you wanted an element-wise multiplication as well? also your ratio is above 1 which makes the second part negative?) so it forces your predictions to behave in a certain way before applying the softmax.

If you want weighted loss you should apply this after

loss = tf.nn.softmax_cross_entropy_with_logits(weighted_logits, self._targets)

with some element-wise multiplication of your weights.

loss = loss * weights

Where your weights have a shape like [2,]

However, I would not recommend you to use weighted losses. Perhaps try increasing the ratio even further than 1:6.

Architecture

As far as I can read, you are using 5 stacked LSTMs with 3 hidden units per layer?

Try removing the multi rnn and just use a single LSTM/GRU (maybe even just a vanilla RNN) and jack the hidden units up to ~100-1000.

Debugging

Often when you are facing problems with an odd behaving network, it can be a good idea to:

Print everything

Literally print the shapes and values of every tensor in your model, use sess to fetch it and then print it. Your input data, the first hidden representation, your predictions, your losses etc.

You can also use tensorflows tf.Print() x_tensor = tf.Print(x_tensor, [tf.shape(x_tensor)])

Use tensorboard

Using tensorboard summaries on your gradients, accuracy metrics and histograms will reveal patterns in your data that might explain certain behavior, such as what lead to exploding weights. Like maybe your forget bias goes to infinity or your not tracking gradient through a certain layer etc.

Other questions

  • How large is your dataset?

  • How long are your sequences?

  • Are the 13 features categorical or continuous? You should not normalize categorical variables or represent them as integers, instead you should use one-hot encoding.

Upvotes: 3

Related Questions