Ray 2.5 stops whole experiment upon OOM exception

Question

I am using ray to run an experiment with N trials on a single node (my local pc). Only one trial is running at a time. The trials need a different amount of memory, depending on their set of hyperparameters. When running my experiment, ray 2.5 sometimes crashes with an ray.exceptions.OutOfMemoryError. The exact error looks like this:

Traceback (most recent call last):
  File "/home/.../lib/python3.8/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "/home/.../lib/python3.8/site-packages/ray/_private/auto_init_hook.py", line 18, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/.../lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/.../lib/python3.8/site-packages/ray/_private/worker.py", line 2542, in get
    raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: xxx, ID: 3509da226e9a51e1d3be7f9855d12771eb10a1eea05bf5512de91f73) where the task (actor ID: 91b7d76827552756b57fc3a301000000, name=ImplicitFunc.__init__, pid=134132, memory used=57.48GB) was running was 62.04GB / 62.63GB (0.990643), which exceeds the memory usage threshold of 0.99. Ray killed this worker (ID: a15d6fa9dd1fd4ada92d64a0197f5d7c5cf5731eb8eac96c38d37ced) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip xxx`. To see the logs of the worker, use `ray logs worker-a15d6fa9dd1fd4ada92d64a0197f5d7c5cf5731eb8eac96c38d37ced*out -ip xxx.

Upon receiving the exception, ray stops the whole experiment.

When looking in the errored trial folder, this error is printed:

    Failure # 1 (occurred at 2023-06-21_08-16-56)
The actor died unexpectedly before finishing this task.
    class_name: ImplicitFunc
    actor_id: 6b89355032ae9f5fcad2ced601000000
    pid: 991990
    namespace: 12c53ee2-d178-4697-97a2-d87579e2ef12
    ip: xxxxxx
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.

This to me sounds like the OS might have killed the worker. This is confusing to me, as the above exception suggests that the internal ray-memory-watcher took action..

The Tuner is called like so:

    tune_config = tune.TuneConfig(num_samples=args.num_samples,
                                  metric="AUC",
                                  mode="max")
    run_config = air.RunConfig(name=args.name,
                               storage_path=args.save_dir,
                               verbose=3)
    trainable_with_resources = tune.with_resources(train_validate_model, {"cpu": 24, "gpu": 1})
    tuner = tune.Tuner(trainable_with_resources, param_space=config, tune_config=tune_config, run_config=run_config)
    tuner.fit()

Ray 1.13 (which I used previously) did not stop the entire experiment, but just stated that the particular trial errored. I'd like that functionality back. Why not try the next trials? :)
When I restart my script after the exception, the respective trials seem to run through. It appears, that there are still left-over objects in RAM, that have not been cleaned up before the next trial started. I can't confirm that RAM usage steadily increases. However, it seems like there can be left over objects after a trial. I now also had the case that a ray process (ray::ImplicitFunc) was still running and using up 55GB of memory, even though ray stopped the whole experiment with the above exception. Appear to me, that ray did not stop all its processes.
I am confused by the terms "actor", "task" and "worker". What are these in my case exactly? Is it worth it to set max_restarts or max_task_retries?

Ray 2.5 stops whole experiment upon OOM exception

Answers (1)

Related Questions