How to run memory intensive shell script from PySpark rdd.mapPartitions

Question

Let's say I have a Spark cluster with 32gb of RAM nodes. 1G of executor memory is enough for processing any of my data.

There is a Linux shell program (program) that I need to run for each partition. It would sound easy if it's simple Linux pipe script, but the program requires 10GB of memory for each run. My initial assumption was that I can just increase executor memory to 11GB and Spark will use one executor per partition for 1G and the other 10G will be allocated for the program that will run in context of executor. But it's not. It's using 11GB for 1G of Spark data and after that it runs the 10GB program in available node memory.

So, I've changed the executor memory back to 1GB and decided to play with cores, instances and yarn. I've tried to use: --executor_memory 1G --driver_memory 1G --executor_cores 1 --num_executors 1

and for YARN 32GB - (10GB * 2 running programs per node) = 12G - 4G for OS = 8G * 1024M = : "yarn.nodemanager.resource.memory-mb": "8192", "yarn.scheduler.maximum-allocation-mb": "8192"

Because I'm using 1G per executor, Spark starts 8192 / (1024 * 1.18 of overhead) ~ 6 executors per node. Definitely if each executor will start 10GB program, there will be no RAM to do this. I've increased executor memory to reduce number of executors per node to 2 with executor memory = 3GB

Now it runs 2 executors per node, but the program still fails with Out of Memory exception.

I've added a code to check available memory right before starting the program

total_memory, used_memory, free_memory, shared_memory, cache, available_memory = map(
int, os.popen('free -t -m | grep Mem:').readlines()[0].split()[1:])

But even if available_memory is > 10G the program starts, but it's running out of memory in a middle (it runs for about 4 mins).

Is there a way to allocate memory for external script on a executor nodes? Or maybe there is a workaround for this?

I would appreciate any HELP!!! Thanks in advance, Orka

How to run memory intensive shell script from PySpark rdd.mapPartitions

Answers (1)

Related Questions