Reputation: 2992
Noticed some strange behavior with PySpark, would appreciate any insights.
Suppose I have an RDD composed of simple elements
from collections import namedtuple
Animal = namedtuple('Animal', ('name','age'))
a = Animal('jeff',3)
b = Animal('mike',5)
c = Animal('cathy',5)
rdd=sc.parallelize([a,b,c])
Now I'm interested in capturing, in a simple class, the different attributes of that RDD, using for example rdd.map(lambda s: getattr(s,'name'))
to extract the name
attribute from each element.
So objects of this class
class simple():
def __init__(self,name):
self.name=name
def get_value(self):
self.value = rdd.map(lambda s: getattr(s,self.name)).collect()
will set their name
and fetch the corresponding values
from the RDD.
theAges = simple('age')
theAges.get_value()
However, this encounters an error that I think centers on the self.name
in the lambda
expression. This second class works fine
class simple2():
def __init__(self,name):
self.name=name
def get_value(self):
n=self.name
self.value = rdd.map(lambda s: getattr(s,n)).collect()
where all I have added is a preceding call n=self.name
and passed n
into the lambda
instead of self.name
.
So is the problem that we are unable to evaluate self.name
within the lambda
? I've created a similar situation (with self.name
in a lambda
) in pure python and there's no errors, so I think this is Spark specific. Thanks for your thoughts.
Upvotes: 1
Views: 1315
Reputation: 154
This is due to pyspark being unable to create a closure over the class instance. Assigning n
in the get_value
scope allows Spark to ship off the pickled function including what amounts to an alias to the object attribute. So far, it seems the solution is to just assign class attributes in the function scope (but don't count on them changing!)
Upvotes: 1