Vion
Vion

Reputation: 569

Apache Storm: Mutable Object emitted to different bolts

Since a few weeks, we are using Storm in our project. Today, we discovered a very strange behavior. Let's assume we have the following topology:


SpoutA ---> BoltB
       ---> BoltC 

So, we have a SpoutA that emits a Custom Java Object (let's call it Message) to two different Bolts. BoltB and BoltC. Basically, we perform a split.

Till today, we had the assumption that if SpoutA emits the Message Object, it gets serialized on SpoutA and deserialized on BoltB as well as on BoltC. However, this assumption seems to be wrong. Today, we discovered that the deserialized object in BoltB is identical (Same System.identitfyHashCode) to the object in BoltC. In other words, if I manipulate the Object in BoltB, I also manipulate the Object in BoltC, leading to many unforeseen side effects.

In addition, this behavior seems very strange to me since it only applies if the SpoutA and the corresponding Bolts B and C are running in the same worker. If I explicitly force to use three works, then the object is (as expected) a different object for BoltB and for BoltC since it is used in different JVMs. Hence, if we assume we have a larger topology (50 different bolts) running on three workers then we could never be sure if objects are currently shared between bolts or not.

So basically, we really do NOT want that an Object is shared between bolts. We usually would expect that during the deserialization, new different objects are created for each of the bolts.

So here is my question: What are our major flaws here? Is our major flaw that we emit "mutable" objects? Are we using serialization/deserialization wrong? Or may it even be a design flaw of storm?

Obviously, we could force the creation of new objects by just emitting byte arrays but in my opinion this is contradictory to Storm.

Best regards, André

Upvotes: 2

Views: 573

Answers (2)

Chris Gerken
Chris Gerken

Reputation: 16392

Storm uses two different queueing approaches when moving tuples from one component to another, one where the two components are inside the same JVM and one where the tuple has to travel across JVM's. I think you're getting caught up in the in-the-same-JVM case where objects in tuples aren't actually serialized as serialization is only required for cross-JVM queues.

I always marshall and demarshall the data between tuple and Java bean to provide a strongly typed interface for my business logic in each bolt/spout. In doing so I suppose I inadvertently avoid the problem you're hitting. That might be one way to get around your problem.

Upvotes: 1

Alma Alma
Alma Alma

Reputation: 1722

Why do you expect the hashcode to be different? Just like there is no requirement that user provided hashcode values should be different for each new object instance (having the same fields and field values!), there is nothing that requires the native hashcode implementation return different values while creating the same object twice.

So to answer your question: What are our major flaws here?

The major flaw is your understanding how hashcode works. As Mahesh Madushanka pointed out in the comment, you can work around this feature.

Also when you serialize an object it serializes everything, the private fields included. And many Java objects cache their hash value in a private field. E.g. String. So it is completely normal that their hashcode will return the same value (I know that you use System.identitfyHashCode, and String is returning a overridden value, but still this is important to remember).

Upvotes: 0

Related Questions