Reputation: 1294
This has been bothering me for a while and I am sure I am being very brainless.
I have two RDDs of key/value pairs, corresponding to a name and associated sparse vector:
RDDA = [ (nameA1, sparsevectorA1), (nameA2, sparsevectorA2), (nameA3, sparsevectorA3) ]
RDDB = [ (nameB1, sparsevectorB1), (nameB2, sparsevectorB2) ]
I want the end result to compare each element of the first RDD against each element in the second, producing an RDD of 3 * 2 = 6 elements. In particular, I want the name of the element in the second RDD and the dot product of the two sparsevectors:
RDDC = [ (nameB1, sparsevectorA1.dot(sparsevectorB1)), (nameB2, sparsevectorA1.dot(sparsevectorB2)),
(nameB1, sparsevectorA2.dot(sparsevectorB1)), (nameB2, sparsevectorA2.dot(sparsevectorB2)),
(nameB1, sparsevectorA3.dot(sparsevectorB1)), (nameB2, sparsevectorA3.dot(sparsevectorB2)) ]
Is there an appropriate map or inbuilt function to do this?
I assume such an operation must exist hence my feeling of brainlessness. I can easily and inelegantly do this if I collect the two RDDs and then implement a for loop, but of course that is not satisfactory as I want to keep them in RDD form.
Thanks for your help!
Upvotes: 0
Views: 889
Reputation: 330413
Is there an appropriate map or inbuilt function to do this?
Yes, there is and it is called cartesian
.
def transform(ab):
(_, vec_a), (name_b, vec_b) = ab
return name_b, vec_a.dot(vec_b)
rddA.cartesian(rddB).map(transform)
Problem is that Cartesian product on the large dataset is usually a really bad idea and there is usually much better approach out there.
Upvotes: 1