Spark with own map and reduce functions python

Question

I'm trying to do a mapreduce like operation using python spark. Here is what i have and my problem.

object_list = list(objects) #this is precomputed earlier in my script
def my_map(obj):
    return [f(obj)]
def my_reduce(obj_list1, obj_list2):
    return obj_list1 + obj_list2

What I am trying to do in is something like the following:

myrdd = rdd(object_list) #objects are now spread out
myrdd.map(my_map)
myrdd.reduce(my_reduce)
my_result = myrdd.result()

where my_result should now just be = [f(obj1), f(obj2), ..., f(objn)]. I want to use spark purely for the speed, my script has been taking to long when doing this in a forloop. Does anyone know how to do the above in spark?

Paul · Accepted Answer

It would usually look like this:

myrdd = sc.parallelize(object_list)
my_result = myrdd.map(f).reduce(lambda a,b:a+b)

There is a sum function for RDDs, so this could also be:

myrdd = sc.parallelize(object_list)
my_result = myrdd.map(f).sum()

However, this will give you a single number. f(obj1)+f(obj2)+...

If you want an array of all the responses [f(obj1),f(obj2), ...], you would not use .reduce() or .sum() but instead use .collect():

myrdd = sc.parallelize(object_list)
my_result = myrdd.map(f).collect()

Spark with own map and reduce functions python

Answers (1)

Related Questions