LunarCA
LunarCA

Reputation: 1

Elixir Supervisor not restarting timed-out Poolboy GenServers after DNS timeout

I'm trying to use Poolboy for a worker pool to make a large number of DNS requests. On some of these DNS requests, the DNS query times out, which throws an error and terminates the GenServer worker:

07:44:29.585 [error] GenServer #PID<0.382.0> terminating
** (Socket.Error) timeout
    (socket 0.3.13) lib/socket/datagram.ex:46: Socket.Datagram.recv!/2
    (dns 2.3.0) lib/dns.ex:76: DNS.query/4
    (dmarc_hijack 0.1.0) lib/dmarc.ex:5: Dmarc.fetch_dmarc_record/1
    (dmarc_hijack 0.1.0) lib/dmarc_hijack/worker.ex:16: DmarcHijack.Worker.handle_call/3
    (stdlib 3.17.1) gen_server.erl:721: :gen_server.try_handle_call/4
    (stdlib 3.17.1) gen_server.erl:750: :gen_server.handle_msg/6
    (stdlib 3.17.1) proc_lib.erl:226: :proc_lib.init_p_do_apply/3
Last message (from #PID<0.717.0>): {:fetch_process_dmarc, "12580.tv"}
State: nil
Client #PID<0.717.0> is dead

Eventually, this leads to all of my Poolboy workers getting killed, and the Supervisor does not appear to restart the Worker GenServers. Application functionality then ceases as there are no more workers, but execution does not halt.

I'm try/catch-ing errors in the Poolboy task as well as the DNS client:

Poolboy task:

  defp setup_task(domain) do
    Task.async(fn ->
      :poolboy.transaction(
        :worker,
        fn pid ->
          try do
            GenServer.call(pid, {:fetch_process_dmarc, domain})
          catch :exit, reason ->
            # Handle timeout
            Logger.warning("Probably just got a timeout on #{domain}. Real reason follows:")
            Logger.warning(inspect(reason))
            {domain, {:error, :timeout}}
          end
        end,
        @timeout
      )
    end)
  end

DNS query code:

defmodule Dmarc do
  def fetch_dmarc_record(domain) do
    try do
      DNS.query("_dmarc.#{domain}", :txt, {select_random_dns_server(), 53})
      |> extract_dmarc_record_from_txt()
    catch error ->
        Logger.error(error)
        {:error, :timeout}

    end

  end

It makes the most sense to me that I should be handling the DNS query timeout at the point of making that DNS query, but it's not getting handled by the try/catch block. I think this is happening because the recv! call panics on a timeout, bypassing my try/catch block but I could be wrong here.

Based on my understanding, the supervisor should re-start the terminated GenServers but for whatever reason once they terminate from the timeout they are never restarted.

Application config with Supervisor details

defmodule DmarcHijack.Application do
  use Application

  defp poolboy_config do
    [
      name: {:local, :worker},
      worker_module: DmarcHijack.Worker,
      size: 5,
      max_overflow: 5
    ]
  end

  @impl true
  def start(_type, _args) do
    children = [
      DmarcHijack.ResultsBucket,
      :poolboy.child_spec(:worker, poolboy_config())

    ]

    opts = [strategy: :one_for_one, name: DmarcHijack.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

I'd really appreciate any help available to debug this issue. Thanks!

Upvotes: 0

Views: 131

Answers (1)

LunarCA
LunarCA

Reputation: 1

For anyone who's dealing with the same issue that I am, I resolved this issue by doing the following:

  1. Replaced the catch with rescue for the DNS query
  2. Set the Timeout value for Poolboy to :infinite since the timeout is being handled already by DNS.

I'm pretty sure this isn't the best solution, but it worked for me.

Upvotes: 0

Related Questions