Reputation: 3015
I'm trying to do a mapreduce like operation using python spark. Here is what i have and my problem.
object_list = list(objects) #this is precomputed earlier in my script
def my_map(obj):
return [f(obj)]
def my_reduce(obj_list1, obj_list2):
return obj_list1 + obj_list2
What I am trying to do in is something like the following:
myrdd = rdd(object_list) #objects are now spread out
myrdd.map(my_map)
myrdd.reduce(my_reduce)
my_result = myrdd.result()
where my_result
should now just be = [f(obj1), f(obj2), ..., f(objn)]
. I want to use spark purely for the speed, my script has been taking to long when doing this in a forloop. Does anyone know how to do the above in spark?
Upvotes: 0
Views: 1057
Reputation: 27503
It would usually look like this:
myrdd = sc.parallelize(object_list)
my_result = myrdd.map(f).reduce(lambda a,b:a+b)
There is a sum
function for RDDs, so this could also be:
myrdd = sc.parallelize(object_list)
my_result = myrdd.map(f).sum()
However, this will give you a single number. f(obj1)+f(obj2)+...
If you want an array of all the responses [f(obj1),f(obj2), ...]
, you would not use .reduce()
or .sum()
but instead use .collect()
:
myrdd = sc.parallelize(object_list)
my_result = myrdd.map(f).collect()
Upvotes: 2