Reputation: 376

Tensorflow crash course question about iterate over data set

I am very confused by the my_input_fn() at https://colab.research.google.com/notebooks/mlcc/first_steps_with_tensor_flow.ipynb

If shuffle = False, won't estimator.train() use the same subset of data within the loop? i.e., loop iteration #1 and #0 use the same subset of data.

The goal here is: * call estimator.train() in a loop * evaluate validation error within the loop * train() and evaluate should be done on different subset of the data for different loop iteration#.

From runtime debug msg, it looks that input_func is called each time train() is called, e.g., if loop count is 10, input_func() will be called 10 times. Since input_func sets up the dataset each time (re-initialize the tf.data.Dataset), the evaluate is done on the same subset of dataset for each of the 10 times. The train() is done on the whole set because shuffle = True. If shuffle is False, train will be done on the same subset of dataset for each of the 10 times also.

I understand within each train(), it will iterate through the tf.data.Dataset. But if train() is called again, it will iterate through the same subset of tf.data.Dataset just like the previous time (suppose that shuffle is False).

I looked at the doc. It looks like that to feed different invocations of estimator.train() with different data, one has to create a new dataset for each estimator.train(). e.g., using data row 1-10000 to create tf.data.Dataset for 1st call of estimator.train(), then using data row 10001-20000 to create tf.data.Dataset for 2nd call of estimator.train().

Is there a better way to feed tf.data.Dataset to different invocations of estimator.train() within a loop?

Thanks.

Upvotes: 0

Answers (2)

ARAT

Reputation: 963

No, it will not. tf.data.Dataset.batch() will create batches of the dimension batch_size over the entire set and return them every time get_next() op is called.

Combines consecutive elements of this dataset into batches.

The tensors in the resulting element will have an additional outer dimension, which will be batch_size (or N % batch_size for the last element if batch_size does not divide the number of input elements N evenly and drop_remainder is False). If your program depends on the batches having the same outer dimension, you should set the drop_remainder argument to True to prevent the smaller batch from being produced.

.shuffle() will only change the ordering of the data points. If it is on, every time you will get different data points in batches.

Upvotes: 1

kvish

Reputation: 1012

If you look at your train_model function, you can see these 2 lines:

training_input_fn = lambda:my_input_fn(my_feature_data, targets, batch_size=batch_size)
prediction_input_fn = lambda: my_input_fn(my_feature_data, targets, num_epochs=1, shuffle=False)

If you set shuffle=False, you will get the same ordering of the data each time function is called. Which is exactly what you need for predictions as you are computing the loss this way:

# Compute loss.
root_mean_squared_error = math.sqrt(
    metrics.mean_squared_error(predictions, targets))

You need the correct predictions for each of the corresponding labels so your ordering is important.

Upvotes: 0

Tensorflow crash course question about iterate over data set

Answers (2)

Related Questions