MetallicPriest
MetallicPriest

Reputation: 30825

groupByKey not properly working in spark

So, I have an RDD, which has key-value pair like following.

(Key1, Val1)
(Key1, Val2)
(Key1, Val3)
(Key2, Val4)
(Key2, Val5)

After groupByKey, I expect to get something like this

Key1, (Val1, Val2, Val3)
Key2, (Val4, Val5)

However, I see that same keys are being repeated even after doing groupByKey(). The total number of key value pairs are certainly reduced, but still there are many duplicate keys. What could be the problem?

The type of the key is basically a Java class with fields of integer types. Could it be that spark is also considering things other than the fields of the objects for identifying those objects?

Upvotes: 0

Views: 985

Answers (1)

Daniel Darabos
Daniel Darabos

Reputation: 27455

groupByKey and a lot of other methods in Spark rely on object hashes. If two instances of your class do not return the same hashCode then Spark will not consider them equal even if all their fields are equal.

Make sure you override equals and hashCode!

Upvotes: 2

Related Questions