Reputation: 497
I have been trying to merge the two Rdds below averagePoints1 and kpoints2 . It keeps throwing this error
ValueError: Can not deserialize RDD with different number of items in pair: (2, 1)
and I tried many things but I can't the two Rdds are identical, have the same number of partitions . my next to step is to apply euclidean distance function on the two lists to measure the difference so if any one knows how to solve this error or has a different approach I can follow I would really appreciate it.
Thanks in advance
averagePoints1 = averagePoints.map(lambda x: x[1])
averagePoints1.collect()
Out[15]:
[[34.48939954847243, -118.17286894440112],
[41.028994230117945, -120.46279399895184],
[37.41157578999635, -121.60431843383599],
[34.42627845075509, -113.87191272382309],
[39.00897622397381, -122.63680410846844]]
kpoints2 = sc.parallelize(kpoints,4)
In [17]:
kpoints2.collect()
Out[17]:
[[34.0830381107, -117.960562808],
[38.8057258629, -120.990763316],
[38.0822414157, -121.956922473],
[33.4516748053, -116.592291648],
[38.1808762414, -122.246825578]]
Upvotes: 3
Views: 6043
Reputation: 497
newSample=newCenters.collect() #new centers as a list
samples=zip(newSample,sample) #sample=> old centers
samples1=sc.parallelize(samples)
totalDistance=samples1.map(lambda (x,y):distanceSquared(x[1],y))
for future searchers this is the solution I followed at the end
Upvotes: 0
Reputation: 454
a= [[34.48939954847243, -118.17286894440112],
[41.028994230117945, -120.46279399895184],
[37.41157578999635, -121.60431843383599],
[34.42627845075509, -113.87191272382309],
[39.00897622397381, -122.63680410846844]]
b= [[34.0830381107, -117.960562808],
[38.8057258629, -120.990763316],
[38.0822414157, -121.956922473],
[33.4516748053, -116.592291648],
[38.1808762414, -122.246825578]]
rdda = sc.parallelize(a)
rddb = sc.parallelize(b)
c = rdda.zip(rddb)
print(c.collect())
check this answer Combine two RDDs in pyspark
Upvotes: 3