Roi
Roi

Reputation: 121

Tensorflow high-level Estimator with input_fn from external file reader

[short summary: how to use TF high-level Estimator on Python with an external file reader? or with feed_dict?]

Been struggling with this for few days, couldn't find any solution on-line...

I'm using TF high-level modules (tf.contrib.learn.Estimator on tf1.0, or tf.estimator.Estimator on tf1.1), features and targets (x/y) inputted through an input_fn, and the graph built on the model_fn.

Already trained a nn on 'small' data sets, in which the whole input is the part of the graph, using slice_input_producer etc. (I can push an example to github if it serves ppl here).

I try to train a larger nn on 'heavier' data-sets (10s-100s GB). I have an external Python reader that does some nasty binary file reading, which I really don't want to get into. This reader has its own queue.Queue with m1 samples. When I use it to extract the m1 {features} & {targets}, the net simply saves all these samples as const. in the first layer of the graph... completely undesired.

I try to either -

  1. feed the output of the external file reader as input to my graph.
  2. define a proper tf queue object that will keep updating the queue (each time a sample is dequeued, i want a completely other sample to be enqueued).

Reminding that I use the "high level", e.g.

self.Estimator = tf.contrib.learn.Estimator(
    model_fn=self.model_fn,
    model_dir=self.config['model_dir'],
    config=tf.contrib.learn.RunConfig( ... ) )

def input_fn(self, mode):
    batch_data = self.data[mode].next() # pops out a batch of samples, as numpy 4D matrices 
    ... # some processing of batch data 
    features_dict = dict(data=batch_data.pop('data'))
    targets_dict = batch_data
    return features_dict, targets_dict

self.Estimator.fit(input_fn=lambda: self.input_fn(modekeys.TRAIN))

Upvotes: 1

Views: 979

Answers (2)

Roi
Roi

Reputation: 121

Attached is a final solution for integrating an external reader into the high-level TF api (tf.contrib.learn.Estimator / tf.estimator.Estimator).

Please note:

  • the architecture and "logic" is not important. it's a stupid simple net.
  • the external reader outputs a dictionary of numpy matrices.
  • the input_fn is using this reader.
  • In order to verify that the reader "pulls new values", I both
    • save the recent value to self.status (should be > 1.0)
    • save a summary, to be viewed in tensorboard.

Code example is in gist, and below.

import tensorflow as tf
import numpy as np
modekeys = tf.contrib.learn.ModeKeys
tf.logging.set_verbosity(tf.logging.DEBUG)
# Tested on python 2.7.9, tf 1.1.0

class inputExample:
    def __init__(self):
        self.status = 0.0 # tracing which value was recently 'pushed' to the net
        self.model_dir = 'temp_dir'
        self.get_estimator()

    def input_fn(self):
        # returns features and labels dictionaries as expected by tf Estimator's model_fn
        data, labels = tf.py_func(func=self.input_fn_np, inp=[], Tout=[tf.float32, tf.float32], stateful=True)
        data.set_shape([1,3,3,1]) # shapes are unknown and need to be set for integrating into the network
        labels.set_shape([1,1,1,1])
        return dict(data=data), dict(labels=labels)

    def input_fn_np(self):
        # returns a dictionary of numpy matrices
        batch_data = self.reader()
        return batch_data['data'], batch_data['labels']

    def model_fn(self, features, labels, mode):
        # using tf 2017 convention of dictionaries of features/labels as inputs
        features_in = features['data']
        labels_in = labels['labels']
        pred_layer = tf.layers.conv2d(name='pred', inputs=features_in, filters=1, kernel_size=3)
        tf.summary.scalar(name='label', tensor=tf.squeeze(labels_in))
        tf.summary.scalar(name='pred', tensor=tf.squeeze(pred_layer))
        loss = None
        if mode != modekeys.INFER:
            loss = tf.losses.mean_squared_error(labels=labels_in, predictions=pred_layer)
        train_op = None
        if mode == modekeys.TRAIN:
            train_op = tf.contrib.layers.optimize_loss(
                loss=loss,
                learning_rate = 0.01,
                optimizer = 'SGD',
                global_step = tf.contrib.framework.get_global_step()
            )
        predictions = {'estim_exp': pred_layer}
        return tf.contrib.learn.ModelFnOps(mode=mode, predictions=predictions, loss=loss, train_op=train_op)

    def reader(self):
        self.status += 1
        if self.status > 1000.0:
            self.status = 1.0
        return dict(
            data = np.random.randn(1,3,3,1).astype(dtype=np.float32),
            labels = np.sin(np.ones([1,1,1,1], dtype=np.float32)*self.status)
        )
    def get_estimator(self):
        self.Estimator = tf.contrib.learn.Estimator(
            model_fn = self.model_fn,
            model_dir = self.model_dir,
            config = tf.contrib.learn.RunConfig(
                save_checkpoints_steps = 10,
                save_summary_steps = 10,
                save_checkpoints_secs = None
            )
        )

if __name__ == '__main__':
    ex = inputExample()
    ex.Estimator.fit(input_fn=ex.input_fn)

Upvotes: 4

saeta
saeta

Reputation: 639

You can use tf.constant if you have the training data already in python memory as shown in the abalone TF example: https://github.com/tensorflow/tensorflow/blob/r1.1/tensorflow/examples/tutorials/estimators/abalone.py#L138-L141

Note: copying the data from disk to Python to TensorFlow is often less efficient than constructing an input pipeline in TensorFlow (i.e. loading data from disk directly into TensorFlow Tensors), such as using tf.contrib.learn.datasets.base.load_csv_without_header.

Upvotes: 0

Related Questions