JohnnyS
JohnnyS

Reputation: 191

How to debug the function passed to mapPartitions

How can I deal with the problem that I can't debug my code with print statements inside the function which I pass to mapPartitions() in pyspark?

Consider the example:

def func(kv_iterator):
    for key, value in iterator:
        #do fancy stuff
        print('This print statement does not reach the driver program')
    return [result]

result = someRdd.mapPartitions(func)

Inside func i'd like to do lots of work with iterables and indexing, but i can test my code without having excess to variables inside func.

Is it possible to somehow redirect the print statement from let's say one partition to my driver program / output channel?

Upvotes: 7

Views: 2241

Answers (1)

Alper t. Turker
Alper t. Turker

Reputation: 35229

You can use one of the following:

  • use local mode. All output should be visible in the console. If it is not, your code is probably never executed - try result.count(), result.foreach(lambda _: None), or other action - this is probably the problem here.
  • Redirect stdout (and stderr if you want) to file. For basic prints use file argument:

    print(*objects, sep=' ', end='\n', file=sys.stdout, flush=False)
    
  • Use remote debugger - How can pyspark be called in debug mode?

But what is most important - test function outside Spark. Function used with mapPartitions should accept Iterable (concrete implementation is usually itertools.chain) and return Iterable.

Upvotes: 5

Related Questions