Reputation: 1079
I'm using an r4.8xlarge on AWS Batch Service to run Spark. This is already a big machine, 32 vCPU, and 244 GB. On AWS Batch Service the process runs inside a Docker container. Out of multiple sources, I saw that we should use java with the parameters:
-XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MaxRAMFraction=1
Even with this parameters the process never when over 31Gb resident memory and 45 GB of virtual memory.
As analyzes I did:
java -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MaxRAMFraction=1 -XshowSettings:vm -version
VM settings:
Max. Heap Size (Estimated): 26.67G
Ergonomics Machine Class: server
Using VM: OpenJDK 64-Bit Server VM
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-1~deb9u1-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)
second test
docker run -it --rm 650967531325.dkr.ecr.eu-west-1.amazonaws.com/java8_aws java -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MaxRAMFraction=2 -XshowSettings:vm -version
VM settings:
Max. Heap Size (Estimated): 26.67G
Ergonomics Machine Class: server
Using VM: OpenJDK 64-Bit Server VM
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-1~deb9u1-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)
third test
java -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MaxRAMFraction=10 -XshowSettings:vm -version
VM settings:
Max. Heap Size (Estimated): 11.38G
Ergonomics Machine Class: server
Using VM: OpenJDK 64-Bit Server VM
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-1~deb9u1-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)
The system is build with Native Packager as a standalone application. A SparkSession
is built as follows with Cores
equal to 31 (32-1):
SparkSession
.builder()
.appName(applicationName)
.master(s"local[$Cores]")
.config("spark.executor.memory", "3g")
Answer to egorlitvinenko:
$ docker stats
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
0c971993f830 ecs-marcos-BatchIntegration-DedupOrder-3-default-aab7fa93f0a6f1c86800 1946.34% 27.72GiB / 234.4GiB 11.83% 0B / 0B 72.9MB / 160kB 0
a5d6bf5522f6 ecs-agent 0.19% 19.56MiB / 240.1GiB 0.01% 0B / 0B 25.7MB / 930kB 0
More tests, now with Oracle JDK, the memory never went over 4G:
$ 'spark-submit' '--class' 'integration.deduplication.DeduplicationApp' '--master' 'local[31]' '--executor-memory' '3G' '--driver-memory' '3G' '--conf' '-Xmx=150g' '/localName.jar' '--inPath' 's3a://dp-import-marcos-refined/platform-services/order/merged/*/*/*/*' '--outPath' 's3a://dp-import-marcos-refined/platform-services/order/deduplicated' '--jobName' 'DedupOrder' '--skuMappingPath' 's3a://dp-marcos-dwh/redshift/item_code_mapping'
I used the parameters -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:MaxRAMFraction=2
on my Spark, clearly not using all the memory. How can I go about this issue?
Upvotes: 1
Views: 2253
Reputation: 74779
tl;dr Use --driver-memory
and --executor-memory
while spark-submit
your Spark application or set the proper memory settings of the JVM that hosts the Spark application.
The memory for the driver is by default 1024M
which you can check out using spark-submit
:
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M).
The memory for the executor is by default 1G
which you can check out again using spark-submit
:
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
With that said, it does not really matter how much memory your execution environment has in total as a Spark application won't use more that the default 1G
for the driver and executors.
Since you use local
master URL the memory settings of the driver's JVM are already set when you execute your Spark application. It is simply too late to set the memory settings while creating a SparkSession
. The single JVM of the Spark application (with the driver and a single executor all running on the same JVM) has already been up and so no config
can change it.
In other words, how much memory a Docker container has has no impact on how much memory the Spark application use. They are environments configured independently. Of course, the more memory a Docker container has the more a process inside could ever have (so they are indeed interconnected).
Use --driver-memory
and --executor-memory
while spark-submit
your Spark application or set the proper memory settings of the JVM that hosts the Spark application.
Upvotes: 1