hunter7z
hunter7z

Reputation: 21

Join operation on rdds with mutable objects

I have a question,in case I have 2 pair RDDs:

RDD1 = RDD[(1,1), (1,2)]
RDD2 = RDD[(1, obj)]    // obj is an mutable scala object

RDD1.join(RDD2) operation should get: RDD[(1, (1,obj1)), (1, (2,obj2))]

the question is: are obj1 and obj2 references to the same object? If they are, what happened during this join process?

I used to think that they are two objects deserialized from obj's serialization result, but today I found that operations on obj1 could be reflected in obj2, and I suddenly got confused.

Thanks

Upvotes: 1

Views: 87

Answers (1)

Vladislav Varslavans
Vladislav Varslavans

Reputation: 2944

You can never know for sure if these will be same object or no. It depends on spark internal implementation.

If you need any kind of mutable stuff processed - you need to store objects outside (in hive, hdfs..) and only reference them by id during processing. Though it is not that simple, because the same object in that case might be accessed from multiple threads/executors and you need to handle that correctly.

Generally, as you've been told in comments - using mutable objects in distributed computation is very fragile thing that could be easily broken and hard to solve the problem later.

Upvotes: 2

Related Questions