Optimize Pyspark code to run fast

Question

How do I optimize this code? How to make it fast. Can the subtraction be performed in the Spark Distributed space? Here Rdd is a collection of dictionaries

all_actors =["brad", "tom", "abc", "def"]
init_actors=["tom", "abc"]

for i in all_actors:

        dc={}
        d1=bj.filter(lambda x: x['actor']==i).first()
        for j in init_actors:
            d2=centroids.filter(lambda x: x['actor']==j).first()
            dc={key: (d1[key] - d2[key])**2 for key in d1.keys() if key not in 'actor'}
            val=sum([v for v in dc.values()])
            val=math.sqrt(val)

rdd.take(2)

[{'actor': 'brad',
  'good': 1,
  'bad': 0,
  'average': 0,}
 {'actor': 'tom',
  'good': 0,
  'bad': 1,
  'average': 1,}]

This Rdd has around 30,000 + keys in each dictionary. This is just a sample.

Expected Output:

Find the Euclidean distance between each row in RDD.

Quilir · Accepted Answer

I understand that you need all distances between elements from all_actors with all from init_actors

I think yous should do cartesian product and then make map to get all distances.

all_actors =["brad", "tom", "abc", "def"]
init_actors=["tom", "abc"]

# Create cartesian product of tables
d1=bj.filter(lambda x: x['actor'] in all_actors)
d2=centroids.filter(lambda x: x['actor'] in init_actors)
combinations = d1.cartesian(d2)

Then you just apply map function that calculates distance (I am not sure what layout cartesian result has so you have to figure out how calculate_cartesian should look).

combinations.map(calculate_euclidean)

Edit: I googled that cartesian produces rows of pairs (x,y) - x and y are same type as elements of all/init_actors - so you can just create function:

def calculate_euclidean(x, y):
    dc={key: (x[key] - y[key])**2 for key in x.keys() if key not in 'actor'}
    val=sum([v for v in dc.values()])
    val=math.sqrt(val)

    #returning dict, but you can change result row layout if you want
    return {'value': val,
            'actor1': x['actor']
            'actor2': y['actor']}

All distance calculating operations are distributed so it should run much, much faster.

Optimize Pyspark code to run fast

Answers (1)

Related Questions