Apache Spark RDD cache and Lineage confusion

Question

I fill an RDD with some random math values:

val itemFactors = rddItems.mapValues(newFactors => 
    Vector(Array.fill(2){math.random})
)

I then join that RDD to some other RDD and cache it:

val finalRDD = itemFactors.join(rddItemsUsers).map{
    case(itemid, (itemVector, ((userid, rating), userVector))) => 
        (itemid, itemVector, userid, userVector, rating)}.cache

I then perform a calculation on the data held in the finalRDD:

sqrt(finalRDD.aggregate(0.0)((accum, item) => 
    accum + pow(item._5 - item._4.dot(item._2), 2), _ + _) / finalRDD.count)

I call the final part of the code, sqrt(...) repeatedly from the console and every single time I get a different result - which is not desired as I haven't changed anything! This can be remedied (i.e. made so I get a consistent result) in 2 ways:

Instead of initialising itemFactors with math.random, I can fill the Array with a fixed number, e.g. 1.0
I can do itemFactors.cache.

Now, I understand that due to lineage, every time itemFactors is called it will call math.random and create a new number - this will therefore affect my calculation when it's performed. This is why using a fixed number when filling the Array produces consistent result.

But, the big problem and the bit which I don't understand is: I am caching finalRDD which is what the calculation is performed on, and as it comprises of itemFactors, surely it shouldn't matter what itemFactor's Array is filled with as the node is only visited once? I thought I was beginning to get a grasp on the lineage; however, this has just thrown me.

Apache Spark RDD cache and Lineage confusion

Answers (1)

Related Questions