spaci1010
spaci1010

Reputation: 1

All storm topology workers are shutting down on worker nodes

Let me give a brief of my storm cluster setup:

total nodes : 12 total worker nodes: 12 nimbus node: 2 zookeeper node: 3 pace maker: 2

node 1 & 12 are running storm-nimbus node 1, 2 & 3 are running zookeeper all 12 nodes runs supervisor

Memory of each worker node: 350 GB

recent changes:

upgraded storm from 1.1.2 to 2.4.0 upgraded zookeeper from 3.4.6 to 3.5.10

we have 2 main topologies running which takes maximum memories( though they never run at the maximum configured memory) 1st topology is configured to run on all 12 nodes and take 350 GB max 2nd topology is configured to run on 10 nodes and use 70 GB max

Issue being faced post upgrade.( its intermittent)

  1. During topologies execution all of a sudden the worker process are crashing and new process starts. This is expected behavior, the actual problem is all of the topologies on all the workers are crashing at same time. i can see that supervisor is issuing force kill with code 137 & 20 to those topologies.

  2. On 5-7 nodes the memory utilization reaches 99+% and sometimes available memory is only 400-800MB .( i observed that the zookeeper nodes are also peaking out on memory)

  3. The supervisor logs shows some zookeeper timeout errors before force killing the topologies.

The questions are, under what circumstances can storm supervisor kill all the topologies. could the nimbus/supervisor unable to communicate with zookeeper and killing process.

unfortunately i can not paste the error trace. but if i can get some guidance i can look at all possible places.

thanks a lot in advance

unfortunately i could not try anything, without concrete evidence as we are facing the issue in production and no lower environment matches prod infra configuration

Upvotes: 0

Views: 63

Answers (0)

Related Questions