Reputation: 9875
I have a list of lists like this:
b = [['r','w'],['n','finished']]
I would like to be able to operate on each element within each list.
I can do this locally in python:
result = b.map(lambda aList: \
map(lambda aString: \
'' if aString.strip().lower() in [' finish', 'finished', 'terminate', 'done'] else aString,\
aList))
But, Spark has trouble serializing the inner map
:
File "/<path>/python/pyspark/worker.py", line 88, in main
12/11/2015 18:24:49 [launcher] command = pickleSer._read_with_length(infile)
12/11/2015 18:24:49 [launcher] File "//<path>/spark/python/pyspark/serializers.py", line 156, in _read_with_length
12/11/2015 18:24:49 [launcher] return self.loads(obj)
12/11/2015 18:24:49 [launcher] File "//<path>//python/pyspark/serializers.py", line 405, in loads
12/11/2015 18:24:49 [launcher] return cPickle.loads(obj)
12/11/2015 18:24:49 [launcher] AttributeError: 'module' object has no attribute 'map'
How do I work around this to either, use an inner map or accomplish the same thing?
Upvotes: 0
Views: 7306
Reputation: 18042
Or alternatively and using @zero323's template, if you are using Python 2.x you can use a map
instead of a for
but this is a python
issue not a pyspark
one, and the effect is the same.
to_replace = ['finish', 'finished', 'terminate', 'done']
rdd = sc.parallelize([['r','w'],['n','finished']])
rdd.map(
lambda xs: map(lambda x: "" if x.strip().lower() in to_replace else x, xs)
)
But, if to_replace
list is really big, you should use a broadcast variable.
Upvotes: 2
Reputation: 330353
One way to handle this:
to_replace = ['finish', 'finished', 'terminate', 'done']
rdd = sc.parallelize([['r','w'],['n','finished']])
rdd.map(lambda xs: ['' if x.strip().lower() in to_replace else x for x in xs])
Generally speaking if you find yourself thinking about nested functions it is a good sign you should use a normal function not a lambda expression.
Upvotes: 3