Pramod
Pramod

Reputation: 129

Storm topology processing slowing down gradually

I have been reading about apache Storm tried few examples from storm-starter. Also learnt about how to tune the topology and how to scale it to perform fast enough to meet the required throughput.

I have created example topology with acking enabled, i am able to achieve 3K-5K messages processing per second. It performs really fast in initial 10 to 15min or around 1mil to 2mil message and then it starts slowing down. On storm UI, I can see the overall latency starts going up gradually and does not comes back, after a while the processing drops to only few hundred a second. I am getting exact same behavior for all the typologies i tried, the simplest one is to just read from kafka using KafkaSpout and send it to transform bolt parse the msg and send it to kafka again using KafkaBolt. The parser is very fast as it takes less than a millisecond to parse the message. I tried few option of increasing/describing the parallelism, changing the buffer sizes etc. but same behavior. Please help me to find out the reason for gradual slowness in the topology. Here is the config i am using

1 Nimbus machine (4 CPU) 24GB RAM
2 Supervisor machines (8CPU) and using 1 thread per core with 24GB RAM
4 Node kafka cluster running on above 2 supervisor machines (each topic has 4 partitions)

KafkaSpout(2 parallelism)-->TransformerBolt(8)-->KafkaBolt(2)

topology.executor.receive.buffer.size: 65536
topology.executor.send.buffer.size: 65536
topology.spout.max.batch.size: 65536
topology.transfer.buffer.size: 32
topology.receiver.buffer.size: 8
topology.max.spout.pending: 250

At the start enter image description here enter image description here

After few minutes enter image description here enter image description here

After 45 min - latency started going up enter image description here enter image description here

After 80 min - Latency will keep going up and will go till 100 sec by the time it reaches 8 to 10mil messages enter image description here enter image description here

Visual VM screenshot enter image description here enter image description here

Threads enter image description here enter image description here

Upvotes: 2

Views: 1490

Answers (2)

Pramod
Pramod

Reputation: 129

The issue was due to GC setting with newgen params, it was not using the allocated heap completely so internal storm queues were getting full and running out of memory. The strange thing was that storm did not throw out of memory error, it just got stalled, with the help of visual vm i was able to trace it down.

Upvotes: 0

SQL.injection
SQL.injection

Reputation: 2647

Pay attention to the capacity metric on RT_LEFT_BOLT, it is very close to 1; which explains why your topology is slowing down.

From the Storm documentation:

The Storm UI has also been made significantly more useful. There are new stats "#executed", "execute latency", and "capacity" tracked for all bolts. The "capacity" metric is very useful and tells you what % of the time in the last 10 minutes the bolt spent executing tuples. If this value is close to 1, then the bolt is "at capacity" and is a bottleneck in your topology. The solution to at-capacity bolts is to increase the parallelism of that bolt.

Therefore, your solution is to add more executors (and tasks) to that given bolt (RT_LEFT_BOLT). Another thing you can do is reduce the number of executors on RT_RIGHT_BOLT the capacity indicates you don't need that many executors, probably 1 or 2 can do the job.

Upvotes: 1

Related Questions