Reputation: 1255
I am using ray to run an experiment with N trials on a single node (my local pc). Only one trial is running at a time. The trials need a different amount of memory, depending on their set of hyperparameters. When running my experiment, ray 2.5 sometimes crashes with an ray.exceptions.OutOfMemoryError
. The exact error looks like this:
Traceback (most recent call last):
File "/home/.../lib/python3.8/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
File "/home/.../lib/python3.8/site-packages/ray/_private/auto_init_hook.py", line 18, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/.../lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/home/.../lib/python3.8/site-packages/ray/_private/worker.py", line 2542, in get
raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: xxx, ID: 3509da226e9a51e1d3be7f9855d12771eb10a1eea05bf5512de91f73) where the task (actor ID: 91b7d76827552756b57fc3a301000000, name=ImplicitFunc.__init__, pid=134132, memory used=57.48GB) was running was 62.04GB / 62.63GB (0.990643), which exceeds the memory usage threshold of 0.99. Ray killed this worker (ID: a15d6fa9dd1fd4ada92d64a0197f5d7c5cf5731eb8eac96c38d37ced) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip xxx`. To see the logs of the worker, use `ray logs worker-a15d6fa9dd1fd4ada92d64a0197f5d7c5cf5731eb8eac96c38d37ced*out -ip xxx.
Upon receiving the exception, ray stops the whole experiment.
When looking in the errored trial folder, this error is printed:
Failure # 1 (occurred at 2023-06-21_08-16-56)
The actor died unexpectedly before finishing this task.
class_name: ImplicitFunc
actor_id: 6b89355032ae9f5fcad2ced601000000
pid: 991990
namespace: 12c53ee2-d178-4697-97a2-d87579e2ef12
ip: xxxxxx
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
This to me sounds like the OS might have killed the worker. This is confusing to me, as the above exception suggests that the internal ray-memory-watcher took action..
The Tuner is called like so:
tune_config = tune.TuneConfig(num_samples=args.num_samples,
metric="AUC",
mode="max")
run_config = air.RunConfig(name=args.name,
storage_path=args.save_dir,
verbose=3)
trainable_with_resources = tune.with_resources(train_validate_model, {"cpu": 24, "gpu": 1})
tuner = tune.Tuner(trainable_with_resources, param_space=config, tune_config=tune_config, run_config=run_config)
tuner.fit()
Upvotes: 1
Views: 1426
Reputation: 261
There is some information missing to fully answer the question but I'll try my best:
Ray 1.13 (which I used previously) did not stop the entire experiment, but just stated that the particular trial errored. I'd like that functionality back. Why not try the next trials? :)
Generally this should still be the case - I've tried it locally and confirmed it still works:
from ray import tune
def train(config):
if config["fail"]:
raise RuntimeError
return 5
tune.Tuner(train, param_space={"fail": tune.grid_search([1, 0, 1, 0])}).fit()
Output:
...
╭─────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status fail iter total time (s) _metric │
├─────────────────────────────────────────────────────────────────────────────────┤
│ train_61886_00001 TERMINATED 0 1 8.29697e-05 5 │
│ train_61886_00003 TERMINATED 0 1 7.98702e-05 5 │
│ train_61886_00000 ERROR 1 │
│ train_61886_00002 ERROR 1 │
╰─────────────────────────────────────────────────────────────────────────────────╯
...
The main change in Ray 1.13 is that the "OOM Killer" was enabled per default - which is precisely the error you're seeing. However, it should usually only kill the trainer process, not the whole experiment.
To see why the whole experiment stops, we would need the log outputs (e.g. stdout/stderr) from the experiment run (not just the trial error).
When I restart my script after the exception, the respective trials seem to run through. It appears, that there are still left-over objects in RAM, that have not been cleaned up before the next trial started. I can't confirm that RAM usage steadily increases. However, it seems like there can be left over objects after a trial. I now also had the case that a ray process (ray::ImplicitFunc) was still running and using up 55GB of memory, even though ray stopped the whole experiment with the above exception. Appear to me, that ray did not stop all its processes.
Usually the memory should be cleaned up after a trial finishes (and definitely after the experiment stops!). If the processes are not cleaned up after the experiment stopped, that is a bug, and you should report it to GitHub.
If memory keeps increasing for one trial, that also sounds like a bug. Can you try setting reuse_actors=False
in your TuneConfig
to make sure all trials start a new actor from scratch?
I am confused by the terms "actor", "task" and "worker". What are these in my case exactly? Is it worth it to set max_restarts or max_task_retries?
"Actors" and "Tasks" are Core Ray concepts. An actor is a remote object (and instance of a remote class), and a task is a remote function. These can be executed on any node of a cluster.
"Workers" can mean different things. In your case, the "worker" is a Raylet worker process, which is the process that actually executes your task or actor.
Think about it like this: For each logical CPU in your cluster, one worker process is started (this is worker.py
). When you schedule a task or an actor, the Ray scheduler sends it to a free worker process that then starts executing it.
Unfortunately the term "worker" is quite overloaded: In other contexts, "worker" can refer to distributed training workers - e.g. in data-parallel training.
Upvotes: 0