Reputation: 335
How do I optimize this code? How to make it fast. Can the subtraction be performed in the Spark Distributed space? Here Rdd is a collection of dictionaries
all_actors =["brad", "tom", "abc", "def"]
init_actors=["tom", "abc"]
for i in all_actors:
dc={}
d1=bj.filter(lambda x: x['actor']==i).first()
for j in init_actors:
d2=centroids.filter(lambda x: x['actor']==j).first()
dc={key: (d1[key] - d2[key])**2 for key in d1.keys() if key not in 'actor'}
val=sum([v for v in dc.values()])
val=math.sqrt(val)
rdd.take(2)
[{'actor': 'brad',
'good': 1,
'bad': 0,
'average': 0,}
{'actor': 'tom',
'good': 0,
'bad': 1,
'average': 1,}]
This Rdd has around 30,000 + keys in each dictionary. This is just a sample.
Expected Output:
Find the Euclidean distance between each row in RDD.
Upvotes: 0
Views: 1339
Reputation: 434
I understand that you need all distances between elements from all_actors with all from init_actors
I think yous should do cartesian product and then make map to get all distances.
all_actors =["brad", "tom", "abc", "def"]
init_actors=["tom", "abc"]
# Create cartesian product of tables
d1=bj.filter(lambda x: x['actor'] in all_actors)
d2=centroids.filter(lambda x: x['actor'] in init_actors)
combinations = d1.cartesian(d2)
Then you just apply map function that calculates distance (I am not sure what layout cartesian result has so you have to figure out how calculate_cartesian should look).
combinations.map(calculate_euclidean)
Edit: I googled that cartesian produces rows of pairs (x,y) - x and y are same type as elements of all/init_actors - so you can just create function:
def calculate_euclidean(x, y):
dc={key: (x[key] - y[key])**2 for key in x.keys() if key not in 'actor'}
val=sum([v for v in dc.values()])
val=math.sqrt(val)
#returning dict, but you can change result row layout if you want
return {'value': val,
'actor1': x['actor']
'actor2': y['actor']}
All distance calculating operations are distributed so it should run much, much faster.
Upvotes: 1