abhishek
abhishek

Reputation: 31

AWS Glue job throwing java.lang.OutOfMemoryError: Java heap space

I am running a glue ETL transformation job. This job is suppose to read data from s3 and converts to parquet.

Below is the glue source.... sourcePath is the location of the s3 file.

In this location we have around 100 million json files.. all of them are nested into sub-folders.

So that is the reason I am applying exclusionPattern to exclude and files starting with a (which are around 2.7 million files) and I believe that only the files starting with a will be processed.

val file_paths = Array(sourcePath)

val exclusionPattern = "\"" + sourcePath + "{[!a]}**" + "\""

glueContext
  .getSourceWithFormat(connectionType = "s3",
    options = JsonOptions(Map(
      "paths" -> file_paths, "recurse" -> true, "groupFiles" -> "inPartition", "exclusions" -> s"[$exclusionPattern]"
    )),
    format = "json",
    transformationContext = "sourceDF"
  )
  .getDynamicFrame()
  .map(transformRow, "error in row")
  .toDF()

After running this job with Standard worker type and with G2 worker type as well. I keep getting error

#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 27788"...

And in the cloudwatch I can see that the driver memory is getting utilised 100% but executor memory usage is almost nil.

When running the job I am setting spark.driver.memory=10g and spark.driver.memoryOverhead=4096 and --conf job parameter.

This is the details in the logs

--conf spark.hadoop.yarn.resourcemanager.connect.max-wait.ms=60000 
--conf spark.hadoop.fs.defaultFS=hdfs://ip-myip.compute.internal:1111 
--conf spark.hadoop.yarn.resourcemanager.address=ip-myip.compute.internal:1111 
--conf spark.dynamicAllocation.enabled=true 
--conf spark.shuffle.service.enabled=true 
--conf spark.dynamicAllocation.minExecutors=1 
--conf spark.dynamicAllocation.maxExecutors=4 
--conf spark.executor.memory=20g 
--conf spark.executor.cores=16 
--conf spark.driver.memory=20g 
--conf spark.default.parallelism=80 
--conf spark.sql.shuffle.partitions=80 
--conf spark.network.timeout=600 
--job-bookmark-option job-bookmark-disable 
--TempDir s3://my-location/admin 
--class com.example.ETLJob 
--enable-spark-ui true 
--enable-metrics 
--JOB_ID j_111... 
--spark-event-logs-path s3://spark-ui 
--conf spark.driver.memory=20g 
--JOB_RUN_ID jr_111... 
--conf spark.driver.memoryOverhead=4096 
--scriptLocation s3://my-location/admin/Job/ETL 
--SOURCE_DATA_LOCATION s3://xyz/ 
--job-language scala 
--DESTINATION_DATA_LOCATION s3://xyz123/ 
--JOB_NAME ETL

Any ideas what could be the issue.

Thanks

Upvotes: 1

Views: 4807

Answers (2)

abhishek
abhishek

Reputation: 31

As suggested by @eman...

I applied all 3 groupFiles, groupSize and useS3ListImplementation.. like below

options = JsonOptions(Map(
            "path" -> sourcePath,
            "recurse" -> true,
            "groupFiles" -> "inPartition",
            "groupSize" -> 104857600,//100 mb
            "useS3ListImplementation" -> true
    ))

And that is working for me... there is also an option of "acrossPartitions" if data is not arranged properly.

Upvotes: 0

Eman
Eman

Reputation: 851

If you have too many files, you are probably overwhelming the driver. Try using the useS3ListImplementation. This is an implementation of the Amazon S3 ListKeys operation, which splits large results sets into multiple responses.

try to add:

"useS3ListImplementation" -> true

[1] https://aws.amazon.com/premiumsupport/knowledge-center/glue-oom-java-heap-space-error/

Upvotes: 2

Related Questions