keegan
keegan

Reputation: 2992

pyspark getattr() behavior

Noticed some strange behavior with PySpark, would appreciate any insights.

Suppose I have an RDD composed of simple elements

from collections import namedtuple
Animal = namedtuple('Animal', ('name','age'))
a = Animal('jeff',3)
b = Animal('mike',5)
c = Animal('cathy',5)
rdd=sc.parallelize([a,b,c])

Now I'm interested in capturing, in a simple class, the different attributes of that RDD, using for example rdd.map(lambda s: getattr(s,'name')) to extract the name attribute from each element.

So objects of this class

class simple():
    def __init__(self,name):
        self.name=name
    def get_value(self):
        self.value = rdd.map(lambda s: getattr(s,self.name)).collect()

will set their name and fetch the corresponding values from the RDD.

theAges = simple('age')
theAges.get_value()

However, this encounters an error that I think centers on the self.name in the lambda expression. This second class works fine

class simple2():
    def __init__(self,name):
        self.name=name
    def get_value(self):
        n=self.name
        self.value = rdd.map(lambda s: getattr(s,n)).collect() 

where all I have added is a preceding call n=self.name and passed n into the lambda instead of self.name.

So is the problem that we are unable to evaluate self.name within the lambda? I've created a similar situation (with self.name in a lambda) in pure python and there's no errors, so I think this is Spark specific. Thanks for your thoughts.

Upvotes: 1

Views: 1315

Answers (1)

Vatsu1
Vatsu1

Reputation: 154

This is due to pyspark being unable to create a closure over the class instance. Assigning n in the get_value scope allows Spark to ship off the pickled function including what amounts to an alias to the object attribute. So far, it seems the solution is to just assign class attributes in the function scope (but don't count on them changing!)

Upvotes: 1

Related Questions