Reputation: 21
I have a question,in case I have 2 pair RDDs:
RDD1 = RDD[(1,1), (1,2)]
RDD2 = RDD[(1, obj)] // obj is an mutable scala object
RDD1.join(RDD2)
operation should get: RDD[(1, (1,obj1)), (1, (2,obj2))]
the question is: are obj1
and obj2
references to the same object? If they are, what happened during this join process?
I used to think that they are two objects deserialized from obj
's serialization result, but today I found that operations on obj1
could be reflected in obj2
, and I suddenly got confused.
Thanks
Upvotes: 1
Views: 87
Reputation: 2944
You can never know for sure if these will be same object or no. It depends on spark internal implementation.
If you need any kind of mutable stuff processed - you need to store objects outside (in hive, hdfs..) and only reference them by id during processing. Though it is not that simple, because the same object in that case might be accessed from multiple threads/executors and you need to handle that correctly.
Generally, as you've been told in comments - using mutable objects in distributed computation is very fragile thing that could be easily broken and hard to solve the problem later.
Upvotes: 2