Matt
Matt

Reputation: 1294

Compare each value of one RDD to each key/value pair of another RDD

This has been bothering me for a while and I am sure I am being very brainless.

I have two RDDs of key/value pairs, corresponding to a name and associated sparse vector:

RDDA = [ (nameA1, sparsevectorA1), (nameA2, sparsevectorA2), (nameA3, sparsevectorA3) ]

RDDB = [ (nameB1, sparsevectorB1), (nameB2, sparsevectorB2) ]

I want the end result to compare each element of the first RDD against each element in the second, producing an RDD of 3 * 2 = 6 elements. In particular, I want the name of the element in the second RDD and the dot product of the two sparsevectors:

RDDC = [ (nameB1, sparsevectorA1.dot(sparsevectorB1)), (nameB2, sparsevectorA1.dot(sparsevectorB2)), 
(nameB1, sparsevectorA2.dot(sparsevectorB1)), (nameB2, sparsevectorA2.dot(sparsevectorB2)), 
(nameB1, sparsevectorA3.dot(sparsevectorB1)), (nameB2, sparsevectorA3.dot(sparsevectorB2)) ]

Is there an appropriate map or inbuilt function to do this?

I assume such an operation must exist hence my feeling of brainlessness. I can easily and inelegantly do this if I collect the two RDDs and then implement a for loop, but of course that is not satisfactory as I want to keep them in RDD form.

Thanks for your help!

Upvotes: 0

Views: 889

Answers (1)

zero323
zero323

Reputation: 330413

Is there an appropriate map or inbuilt function to do this?

Yes, there is and it is called cartesian.

def transform(ab):
    (_, vec_a), (name_b, vec_b) = ab
    return name_b, vec_a.dot(vec_b)

rddA.cartesian(rddB).map(transform)

Problem is that Cartesian product on the large dataset is usually a really bad idea and there is usually much better approach out there.

Upvotes: 1

Related Questions