Reputation: 113
I have defined a function in pyspark to calculate the euclidean distance between my centroids and a bunch of points i have.
def dist(x):
b = {'d1':distance.euclidean((6,8),x),'d2':distance.euclidean((1,2),x),'d3':distance.euclidean((5,5),x)}
def get_key(val):
for key, value in b.items():
if val == value:
return key
print(get_key(min(b.values())))
My points are as follows
data = [(3.023, 5.138), (3.075, 4.989), (2.321, 5.35), (3.328, 4.944), (3.195, 5.186)]
My objective is to feed all these points into my function and i get the nearest distance for each point. A hypothetical example of the output i am expecting is somewhat like this
[((3.023, 5.138),d1),
((3.075, 4.989),d1),
((2.321, 5.35),d2),
((3.328, 4.944),d1),
((3.195, 5.186),d3)]
When i feed individual points into this function it works perfectly, however, when i am trying to do this for multiple points using a lambda function, i am getting none instead of the centroids.
data.map(lambda x:(x,dist((x)))).take(5)
(1) Spark Jobs
Out[17]: [((3.023, 5.138), None),
((3.075, 4.989), None),
((2.321, 5.35), None),
((3.328, 4.944), None),
((3.195, 5.186), None)]
What am i doing wrong here? Would appreciate some help.
Upvotes: 0
Views: 1093
Reputation: 54718
Your function dist
doesn't return anything. It calls the print
function, which returns nothing. Naturally, it prints None
.
Change the print
to return
and I suspect you will be happier.
Upvotes: 2