J. Gursky
J. Gursky

Reputation: 131

Ray Tensorflow-gpu 2.0 RecursionError

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04

Ray installed from (source or binary): binary

Ray version: 0.7.3

Python version: 3.7

Tensorflow version: tensorflow-gpu 2.0.0rc0

Exact command to reproduce:

# Importing packages
from time import time
import gym
import tensorflow as tf
import ray

# Creating our initial model    
model = tf.keras.Sequential([
        tf.keras.layers.Dense(64, input_shape=(24,), activation='relu'),
        tf.keras.layers.Dense(4, activation='softmax')
        ])

# Setting parameters
episodes = 64
env_name = 'BipedalWalker-v2'

# Initializing ray
ray.init(num_cpus=8, num_gpus=1)

# Creating our ray function
@ray.remote
def play(weights):
    actor = tf.keras.Sequential([
        tf.keras.layers.Dense(64, input_shape=(24,), activation='relu'),
        tf.keras.layers.Dense(4, activation='softmax')
        ])
    actor = actor.set_weights(weights)
    env = gym.make('BipedalWalker-v2').env
    env._max_episode_steps=1e20
    obs = env.reset()
    for _ in range(1200):
        action = actor.predict_classes(obs).flatten()[0]
        action = env.action_space.sample()
        obs, rt, done, info = env.step(action)
    return rt

# Testing ray
start = time()
weights = model.get_weights()
weights = ray.put(weights)
results = ray.get([play.remote(weights) for i in range(episodes)])
ray.shutdown()
print('Ray done after:',time()-start)

Describe the problem

I am trying to use Ray to parallelize rollouts of OpenAI gym environments using a Tensorflow 2.0-gpu Keras actor. Every time I try to instantiate a Keras model using @ray.remote it raises a recursion depth reached error. I am following the documentation outlined by Ray, where it is suggested to pass weights instead of models. I am not sure what I am doing wrong here, any thoughts?

Source code / logs

File "/home/jacob/anaconda3/envs/tf-2.0-gpu/lib/python3.7/site-packages/tensorflow/init.py", line 50, in getattr module = self._load()

File "/home/jacob/anaconda3/envs/tf-2.0-gpu/lib/python3.7/site-packages/tensorflow/init.py", line 44, in _load module = _importlib.import_module(self.name)

RecursionError: maximum recursion depth exceeded

Upvotes: 2

Views: 1287

Answers (2)

J. Gursky
J. Gursky

Reputation: 131

See the GitHub response to this issue: https://github.com/ray-project/ray/issues/5614

All that needs to be done is import tensorflow in the function definition:

@ray.remote
def play(weights):
    import tensorflow as tf
    actor = tf.keras.Sequential([
        tf.keras.layers.Dense(64, input_shape=(24,), activation='relu'),
        tf.keras.layers.Dense(4, activation='softmax')
        ])
    actor.set_weights(weights)
    env = gym.make('BipedalWalker-v2').env
    env._max_episode_steps=1e20
    obs = env.reset()
    for _ in range(1200):
        action = actor.predict_classes(np.array([obs])).flatten()[0]
        action = env.action_space.sample()
        obs, rt, done, info = env.step(action)
    return rt

Upvotes: 1

Robert Nishihara
Robert Nishihara

Reputation: 3372

The core problem appears to be that cloudpickle (which Ray uses to serialize remote functions and ship them to the worker processes) isn't able to pickle the tf.keras.Sequential class. For example, I can reproduce the issue as follows

import cloudpickle  # cloudpickle.__version__ == '1.2.1'
import tensorflow as tf  # tf.__version__ == '2.0.0-rc0'

def f():
    tf.keras.Sequential

cloudpickle.loads(cloudpickle.dumps(f))  # This fails.

The last line fails with

---------------------------------------------------------------------------
RecursionError                            Traceback (most recent call last)
<ipython-input-23-25cc307e6227> in <module>
----> 1 cloudpickle.loads(cloudpickle.dumps(f))

~/anaconda3/lib/python3.6/site-packages/tensorflow/__init__.py in __getattr__(self, item)
     48 
     49   def __getattr__(self, item):
---> 50     module = self._load()
     51     return getattr(module, item)
     52 

~/anaconda3/lib/python3.6/site-packages/tensorflow/__init__.py in _load(self)
     42   def _load(self):
     43     """Import the target module and insert it into the parent's namespace."""
---> 44     module = _importlib.import_module(self.__name__)
     45     self._parent_module_globals[self._local_name] = module
     46     self.__dict__.update(module.__dict__)

... last 2 frames repeated, from the frame below ...

~/anaconda3/lib/python3.6/site-packages/tensorflow/__init__.py in __getattr__(self, item)
     48 
     49   def __getattr__(self, item):
---> 50     module = self._load()
     51     return getattr(module, item)
     52 

RecursionError: maximum recursion depth exceeded while calling a Python object

Interestingly, this succeeds with tensorflow==1.14.0, but I imagine keras has changed a ton in 2.0.

Workaround

As a workaround, you can try defining f in a separate module or Python file like

# helper_file.py

import tensorflow as tf

def f():
    tf.keras.Sequential

And then use it in your main script as follows.

import helper_file
import ray

ray.init(num_cpus=1)

@ray.remote
def use_f():
    helper_file.f()

ray.get(use_f.remote())

The difference here is that when cloudpickle tries to serialize use_f, it won't actually look at the contents of helper_file. When some worker process tries to deserialize use_f, that worker process will import helper_file. This extra indirection seems to cause cloudpickle to work more reliably. This is the same thing that happens when you pickle a function using tensorflow or any library. Cloudpickle doesn't serialize the whole library, it just tells the deserializing process to import the relevant library.

Note: For this to work on multiple machines, helper_file.py must exist and be on the Python path on each machine (one way to accomplish this is by installing it as a Python module on each machine).

I verified that this seems to address the issue in your example. After making that fix, I ran into

  File "<ipython-input-4-bb51dc74442c>", line 3, in play
  File "/Users/rkn/Workspace/ray/helper_file.py", line 15, in play
    action = actor.predict_classes(obs).flatten()[0]
AttributeError: 'NoneType' object has no attribute 'predict_classes'

but that looks like a separate issue.

Upvotes: 1

Related Questions