Reputation: 131
System information
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
Ray installed from (source or binary): binary
Ray version: 0.7.3
Python version: 3.7
Tensorflow version: tensorflow-gpu 2.0.0rc0
Exact command to reproduce:
# Importing packages
from time import time
import gym
import tensorflow as tf
import ray
# Creating our initial model
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, input_shape=(24,), activation='relu'),
tf.keras.layers.Dense(4, activation='softmax')
])
# Setting parameters
episodes = 64
env_name = 'BipedalWalker-v2'
# Initializing ray
ray.init(num_cpus=8, num_gpus=1)
# Creating our ray function
@ray.remote
def play(weights):
actor = tf.keras.Sequential([
tf.keras.layers.Dense(64, input_shape=(24,), activation='relu'),
tf.keras.layers.Dense(4, activation='softmax')
])
actor = actor.set_weights(weights)
env = gym.make('BipedalWalker-v2').env
env._max_episode_steps=1e20
obs = env.reset()
for _ in range(1200):
action = actor.predict_classes(obs).flatten()[0]
action = env.action_space.sample()
obs, rt, done, info = env.step(action)
return rt
# Testing ray
start = time()
weights = model.get_weights()
weights = ray.put(weights)
results = ray.get([play.remote(weights) for i in range(episodes)])
ray.shutdown()
print('Ray done after:',time()-start)
Describe the problem
I am trying to use Ray to parallelize rollouts of OpenAI gym environments using a Tensorflow 2.0-gpu Keras actor. Every time I try to instantiate a Keras model using @ray.remote it raises a recursion depth reached error. I am following the documentation outlined by Ray, where it is suggested to pass weights instead of models. I am not sure what I am doing wrong here, any thoughts?
Source code / logs
File "/home/jacob/anaconda3/envs/tf-2.0-gpu/lib/python3.7/site-packages/tensorflow/init.py", line 50, in getattr module = self._load()
File "/home/jacob/anaconda3/envs/tf-2.0-gpu/lib/python3.7/site-packages/tensorflow/init.py", line 44, in _load module = _importlib.import_module(self.name)
RecursionError: maximum recursion depth exceeded
Upvotes: 2
Views: 1287
Reputation: 131
See the GitHub response to this issue: https://github.com/ray-project/ray/issues/5614
All that needs to be done is import tensorflow in the function definition:
@ray.remote
def play(weights):
import tensorflow as tf
actor = tf.keras.Sequential([
tf.keras.layers.Dense(64, input_shape=(24,), activation='relu'),
tf.keras.layers.Dense(4, activation='softmax')
])
actor.set_weights(weights)
env = gym.make('BipedalWalker-v2').env
env._max_episode_steps=1e20
obs = env.reset()
for _ in range(1200):
action = actor.predict_classes(np.array([obs])).flatten()[0]
action = env.action_space.sample()
obs, rt, done, info = env.step(action)
return rt
Upvotes: 1
Reputation: 3372
The core problem appears to be that cloudpickle (which Ray uses to serialize remote functions and ship them to the worker processes) isn't able to pickle the tf.keras.Sequential
class. For example, I can reproduce the issue as follows
import cloudpickle # cloudpickle.__version__ == '1.2.1'
import tensorflow as tf # tf.__version__ == '2.0.0-rc0'
def f():
tf.keras.Sequential
cloudpickle.loads(cloudpickle.dumps(f)) # This fails.
The last line fails with
---------------------------------------------------------------------------
RecursionError Traceback (most recent call last)
<ipython-input-23-25cc307e6227> in <module>
----> 1 cloudpickle.loads(cloudpickle.dumps(f))
~/anaconda3/lib/python3.6/site-packages/tensorflow/__init__.py in __getattr__(self, item)
48
49 def __getattr__(self, item):
---> 50 module = self._load()
51 return getattr(module, item)
52
~/anaconda3/lib/python3.6/site-packages/tensorflow/__init__.py in _load(self)
42 def _load(self):
43 """Import the target module and insert it into the parent's namespace."""
---> 44 module = _importlib.import_module(self.__name__)
45 self._parent_module_globals[self._local_name] = module
46 self.__dict__.update(module.__dict__)
... last 2 frames repeated, from the frame below ...
~/anaconda3/lib/python3.6/site-packages/tensorflow/__init__.py in __getattr__(self, item)
48
49 def __getattr__(self, item):
---> 50 module = self._load()
51 return getattr(module, item)
52
RecursionError: maximum recursion depth exceeded while calling a Python object
Interestingly, this succeeds with tensorflow==1.14.0
, but I imagine keras has changed a ton in 2.0.
As a workaround, you can try defining f
in a separate module or Python file like
# helper_file.py
import tensorflow as tf
def f():
tf.keras.Sequential
And then use it in your main script as follows.
import helper_file
import ray
ray.init(num_cpus=1)
@ray.remote
def use_f():
helper_file.f()
ray.get(use_f.remote())
The difference here is that when cloudpickle tries to serialize use_f
, it won't actually look at the contents of helper_file
. When some worker process tries to deserialize use_f
, that worker process will import helper_file
. This extra indirection seems to cause cloudpickle to work more reliably. This is the same thing that happens when you pickle a function using tensorflow or any library. Cloudpickle doesn't serialize the whole library, it just tells the deserializing process to import the relevant library.
Note: For this to work on multiple machines, helper_file.py
must exist and be on the Python path on each machine (one way to accomplish this is by installing it as a Python module on each machine).
I verified that this seems to address the issue in your example. After making that fix, I ran into
File "<ipython-input-4-bb51dc74442c>", line 3, in play
File "/Users/rkn/Workspace/ray/helper_file.py", line 15, in play
action = actor.predict_classes(obs).flatten()[0]
AttributeError: 'NoneType' object has no attribute 'predict_classes'
but that looks like a separate issue.
Upvotes: 1