Apache Storm: Mutable Object emitted to different bolts

Question

Since a few weeks, we are using Storm in our project. Today, we discovered a very strange behavior. Let's assume we have the following topology:

SpoutA ---> BoltB
       ---> BoltC

So, we have a SpoutA that emits a Custom Java Object (let's call it Message) to two different Bolts. BoltB and BoltC. Basically, we perform a split.

Till today, we had the assumption that if SpoutA emits the Message Object, it gets serialized on SpoutA and deserialized on BoltB as well as on BoltC. However, this assumption seems to be wrong. Today, we discovered that the deserialized object in BoltB is identical (Same System.identitfyHashCode) to the object in BoltC. In other words, if I manipulate the Object in BoltB, I also manipulate the Object in BoltC, leading to many unforeseen side effects.

In addition, this behavior seems very strange to me since it only applies if the SpoutA and the corresponding Bolts B and C are running in the same worker. If I explicitly force to use three works, then the object is (as expected) a different object for BoltB and for BoltC since it is used in different JVMs. Hence, if we assume we have a larger topology (50 different bolts) running on three workers then we could never be sure if objects are currently shared between bolts or not.

So basically, we really do NOT want that an Object is shared between bolts. We usually would expect that during the deserialization, new different objects are created for each of the bolts.

So here is my question: What are our major flaws here? Is our major flaw that we emit "mutable" objects? Are we using serialization/deserialization wrong? Or may it even be a design flaw of storm?

Obviously, we could force the creation of new objects by just emitting byte arrays but in my opinion this is contradictory to Storm.

Best regards, André

Chris Gerken · Accepted Answer

Storm uses two different queueing approaches when moving tuples from one component to another, one where the two components are inside the same JVM and one where the tuple has to travel across JVM's. I think you're getting caught up in the in-the-same-JVM case where objects in tuples aren't actually serialized as serialization is only required for cross-JVM queues.

I always marshall and demarshall the data between tuple and Java bean to provide a strongly typed interface for my business logic in each bolt/spout. In doing so I suppose I inadvertently avoid the problem you're hitting. That might be one way to get around your problem.

Apache Storm: Mutable Object emitted to different bolts

Answers (2)

Related Questions