Reputation: 21
I've been trying to set up a distributed cluster running the Boston Housing example mentioned in the TensorFlow tutorial but so far I'm a bit lost. Googling or searching in the tutorials was no help.
"""DNNRegressor with custom input_fn for Housing dataset."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import itertools
import json
import os
import pandas as pd
import tensorflow as tf
tf.logging.set_verbosity(tf.logging.INFO)
COLUMNS = ["crim", "zn", "indus", "nox", "rm", "age",
"dis", "tax", "ptratio", "medv"]
FEATURES = ["crim", "zn", "indus", "nox", "rm",
"age", "dis", "tax", "ptratio"]
LABEL = "medv"
def input_fn(data_set):
feature_cols = {k: tf.constant(data_set[k].values) for k in FEATURES}
labels = tf.constant(data_set[LABEL].values)
return feature_cols, labels
def main(unused_argv):
# Load datasets
training_set = pd.read_csv("boston_train.csv", skipinitialspace=True,
skiprows=1, names=COLUMNS)
test_set = pd.read_csv("boston_test.csv", skipinitialspace=True,
skiprows=1, names=COLUMNS)
# Set of 6 examples for which to predict median house values
prediction_set = pd.read_csv("boston_predict.csv", skipinitialspace=True,
skiprows=1, names=COLUMNS)
# Feature cols
feature_cols = [tf.contrib.layers.real_valued_column(k)
for k in FEATURES]
cluster = {'ps': ['10.134.96.44:2222', '10.134.96.184:2222'],
'worker': ['10.134.96.37:2222', '10.134.96.145:2222']}
os.environ['TF_CONFIG'] = json.dumps(
{'cluster': cluster,
'task': {'type': 'worker', 'index': 0}})
# Build 2 layer fully connected DNN with 10, 10 units respectively.
regressor = tf.contrib.learn.DNNRegressor(feature_columns=feature_cols,
hidden_units=[10, 10],
model_dir="/tmp/boston_model",
config=tf.contrib.learn.RunConfig())
# Fit
regressor.fit(input_fn=lambda: input_fn(training_set), steps=5000)
# Score accuracy
ev = regressor.evaluate(input_fn=lambda: input_fn(test_set), steps=1)
loss_score = ev["loss"]
print("Loss: {0:f}".format(loss_score))
# Print out predictions
y = regressor.predict(input_fn=lambda: input_fn(prediction_set))
# .predict() returns an iterator; convert to a list and print predictions
predictions = list(itertools.islice(y, 6))
print("Predictions: {}".format(str(predictions)))
if __name__ == "__main__":
tf.app.run()
I'm not sure if I've set up TF_CONFIG correctly here. I used a cluster of 4 machines - two PSs and two workers but I didn't set 'environment' in cluster nor 'master' machines. I first started two PSs running, and then when I ran two workers, it was stuck right after "INFO:tensorflow:Create CheckpointSaverHook." Did I do anything wrong here?
I appreciate your help.
Upvotes: 2
Views: 816
Reputation: 69
I had the exact same problem. The issue is that the grpc server never actually gets started. I made the same assumption you did - that tf.learn starts the grpc server - but it does not. You can start a server from inside your python script. Then, depending on if the process is running a 'ps' or 'worker' task, you either call server.join()
or run the rest of your model's code:
job = sys.argv[1]
task = int(sys.argv[2])
cluster = {'worker': ['localhost:2223'],
'ps': ['localhost:2222']}
os.environ['TF_CONFIG'] = json.dumps({'cluster': cluster,
'task': {'type': job, 'index': task}})
# Create the server
server = tf.train.Server(cluster,
job_name=job,
task_index=task)
if job == "ps":
server.join()
elif job == "worker":
# Load input
# estimator.fit()
For more information, checkout: how to run tensorflow distributed mnist example
And
https://www.tensorflow.org/deploy/distributed#putting-it-all-together-example-trainer-program
Upvotes: 1