Ishan Khare
Ishan Khare

Reputation: 1767

Erlang monitor multiple processes

I need to monitor a bunch of worker processes. Currently I'm able to monitor 1 process through 1 monitor. How do i scale this to monitoring N worker processes. Do i need to spawn N monitors as well? If so then what happens if one of those spawned monitors failed/crashed?

Upvotes: 1

Views: 1001

Answers (2)

7stud
7stud

Reputation: 48589

Do i need to spawn N monitors as well?

No:

-module(mo).
-compile(export_all).

worker(Id) ->
    timer:sleep(1000 * rand:uniform(5)),
    io:format("Worker~w: I'm still alive~n", [Id]),
    worker(Id).

create_workers(N) ->
    Workers = [  % { {Pid, Ref}, Id }
        { spawn_monitor(?MODULE, worker, [Id]), Id }
        || Id <- lists:seq(1, N)
    ],
    monitor_workers(Workers).

monitor_workers(Workers) ->
    receive
        {'DOWN', Ref, process, Pid, Why} ->
            Worker = {Pid, Ref},
            case is_my_worker(Worker, Workers) of
                true  ->  
                    NewWorkers = replace_worker(Worker, Workers, Why),
                    io:format("Old Workers:~n~p~n", [Workers]),
                    io:format("New Workers:~n~p~n", [NewWorkers]),
                    monitor_workers(NewWorkers);
                false -> 
                    monitor_workers(Workers)
            end;
        _Other -> 
            monitor_workers(Workers)
    end.
    
is_my_worker(Worker, Workers) ->
    lists:keymember(Worker, 1, Workers).

replace_worker(Worker, Workers, Why) ->
    {{Pid, _}, Id} = lists:keyfind(Worker, 1, Workers),
    io:format("Worker~w (~w) went down: ~s~n", [Id, Pid, Why]),
    NewWorkers = lists:keydelete(Worker, 1, Workers),
    NewWorker = spawn_monitor(?MODULE, worker, [Id]),
    [{NewWorker, Id}|NewWorkers].

start() ->
    observer:start(),  %%In the Processes tab, you can right click on a worker and kill it.
    create_workers(4).

In the shell:

$ ./run
Erlang/OTP 19 [erts-8.2] [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V8.2  (abort with ^G)


1> Worker3: I'm still alive
Worker1: I'm still alive
Worker2: I'm still alive
Worker4: I'm still alive
Worker3: I'm still alive
Worker1: I'm still alive
Worker4: I'm still alive
Worker2: I'm still alive
Worker3: I'm still alive
Worker1: I'm still alive
Worker4: I'm still alive
Worker3 (<0.87.0>) went down: killed
Old Workers:
[{{<0.85.0>,#Ref<0.0.4.292>},1},
 {{<0.86.0>,#Ref<0.0.4.293>},2},
 {{<0.87.0>,#Ref<0.0.4.294>},3},
 {{<0.88.0>,#Ref<0.0.4.295>},4}]
New Workers:
[{{<0.2386.0>,#Ref<0.0.1.416>},3},
 {{<0.85.0>,#Ref<0.0.4.292>},1},
 {{<0.86.0>,#Ref<0.0.4.293>},2},
 {{<0.88.0>,#Ref<0.0.4.295>},4}]
Worker2: I'm still alive
Worker1: I'm still alive
Worker2: I'm still alive
Worker1: I'm still alive
Worker1: I'm still alive
Worker4: I'm still alive
Worker3: I'm still alive
Worker2: I'm still alive
Worker1: I'm still alive
Worker3: I'm still alive
Worker4: I'm still alive
Worker1: I'm still alive
Worker4 (<0.88.0>) went down: killed
Old Workers:
[{{<0.2386.0>,#Ref<0.0.1.416>},3},
 {{<0.85.0>,#Ref<0.0.4.292>},1},
 {{<0.86.0>,#Ref<0.0.4.293>},2},
 {{<0.88.0>,#Ref<0.0.4.295>},4}]
New Workers:
[{{<0.5322.0>,#Ref<0.0.1.9248>},4},
 {{<0.2386.0>,#Ref<0.0.1.416>},3},
 {{<0.85.0>,#Ref<0.0.4.292>},1},
 {{<0.86.0>,#Ref<0.0.4.293>},2}]
Worker3: I'm still alive
Worker2: I'm still alive
Worker4: I'm still alive
Worker1: I'm still alive
Worker3: I'm still alive
Worker3: I'm still alive
Worker2: I'm still alive
Worker1 (<0.85.0>) went down: killed
Old Workers:
[{{<0.5322.0>,#Ref<0.0.1.9248>},4},
 {{<0.2386.0>,#Ref<0.0.1.416>},3},
 {{<0.85.0>,#Ref<0.0.4.292>},1},
 {{<0.86.0>,#Ref<0.0.4.293>},2}]
New Workers:
[{{<0.5710.0>,#Ref<0.0.1.10430>},1},
 {{<0.5322.0>,#Ref<0.0.1.9248>},4},
 {{<0.2386.0>,#Ref<0.0.1.416>},3},
 {{<0.86.0>,#Ref<0.0.4.293>},2}]
Worker2: I'm still alive
Worker3: I'm still alive
Worker4: I'm still alive
Worker3: I'm still alive

I think the version below is probably more efficient: it uses lists:map() to both search for and replace the crashed worker, so it only traverses the Worker's list once:

-module(mo).
-compile(export_all).

worker(Id) ->
    timer:sleep(1000 * rand:uniform(5)),
    io:format("Worker~w: I'm still alive~n", [Id]),
    worker(Id).

create_workers(N) ->
    Workers = [  % { {Pid, Ref}, Id }
        { spawn_monitor(?MODULE, worker, [Id]), Id }
        || Id <- lists:seq(1,N)
    ],
    monitor_workers(Workers).

monitor_workers(Workers) ->
    receive
        {'DOWN', Ref, process, Pid, Why} ->
            CrashedWorker = {Pid, Ref},
            NewWorkers = replace(CrashedWorker, Workers, Why),
            io:format("Old Workers:~n~p~n", [Workers]),
            io:format("New Workers:~n~p~n", [NewWorkers]),
            monitor_workers(NewWorkers);
        _Other -> 
            monitor_workers(Workers)
    end.

replace(CrashedWorker, Workers, Why) ->
    lists:map(fun(PidRefId) ->
                      { {Pid,_Ref}=Worker, Id} = PidRefId,
                      case Worker =:= CrashedWorker of
                          true ->  %replace worker
                              io:format("Worker~w (~w) went down: ~s~n", 
                                        [Id, Pid, Why]),
                              {spawn_monitor(?MODULE, worker, [Id]), Id}; %=> { {Pid,Ref}, Id }
                          false ->  %leave worker alone
                              PidRefId  
                      end
              end,
              Workers).

start() ->
    observer:start(),  %%In the Processes tab, you can right click on a worker and kill it.
    create_workers(4).

If so then what happens if one of those spawned monitors failed/crashed?

Erlang owns several server farms in different countries, and erlang has acquired several redundant power grids, so erlang will restart everything in a fault tolerant, distributed system that will never fail. It's all built in. You don't have to worry about anything. :)

Actually...anywhere that you can imagine something failing, then it has to be backed up, e.g. by another monitoring process on another computer.

Upvotes: 3

rorra
rorra

Reputation: 9693

Do not spawn and then monitor, that uses to cause issues on production on the past, instead use spawn_monitor

You can start and monitor multiple process from your supervisor, if you check the documentation on monitor you will notice that every time a monitored process died, it will send a message like:

{'DOWN', MonitorRef, Type, Object, Info}

to the supervisor process that is monitoring the process that just died

And then you can decide what to do, MonitorRef is the Reference that you got when you started to monitor the process, Object will have the Pid of the process that died, the registered name if you assigned it a name.

It is a nice exercise to create some sample code using monitor, but try to stick to the OTP library and OTP Supervisors instead.

Upvotes: 0

Related Questions