Reputation: 191
How can I deal with the problem that I can't debug my code with print statements inside the function which I pass to mapPartitions() in pyspark?
Consider the example:
def func(kv_iterator):
for key, value in iterator:
#do fancy stuff
print('This print statement does not reach the driver program')
return [result]
result = someRdd.mapPartitions(func)
Inside func i'd like to do lots of work with iterables and indexing, but i can test my code without having excess to variables inside func.
Is it possible to somehow redirect the print statement from let's say one partition to my driver program / output channel?
Upvotes: 7
Views: 2241
Reputation: 35229
You can use one of the following:
local
mode. All output should be visible in the console. If it is not, your code is probably never executed - try result.count()
, result.foreach(lambda _: None)
, or other action - this is probably the problem here.Redirect stdout (and stderr if you want) to file. For basic prints
use file
argument:
print(*objects, sep=' ', end='\n', file=sys.stdout, flush=False)
Use remote debugger - How can pyspark be called in debug mode?
But what is most important - test function outside Spark. Function used with mapPartitions
should accept Iterable
(concrete implementation is usually itertools.chain
) and return Iterable
.
Upvotes: 5