Reputation: 794
Amazon EMR, Apache Spark 2.3, Apache Kafka, ~10 mln records per day.
Apache Spark used for processing events in batches by 5 minutes, once per day worker nodes are dying and AWS reprovision automatically the nodes. On reviewing the log messages it looks like no space in the nodes, but they are having about 1Tb storage there.
Did someone has the issues with storage space in cases when it should be more than enough?
I was thinking the log aggregation could not copy properly the logs to s3 bucket, that should be done automatically by spark process as I see.
What kind of the information should I provide to help to resolve this issue?
Thank you in advance!
Upvotes: 8
Views: 1989
Reputation: 794
I believe I fixed the issue using the custom log4j.properties, on deploy to Amazon EMR I replaced /etc/spark/log4j.properties and then run spark-submit with my streaming application.
Now it's working well.
https://gist.github.com/oivoodoo/d34b245d02e98592eff6a83cfbc401e3
Also it could be helpful for someone who is using streaming application and need to rollout the updates with graceful stop.
https://gist.github.com/oivoodoo/4c1ef67544b2c5023c249f21813392af
https://gist.github.com/oivoodoo/cb7147a314077e37543fdf3020730814
Upvotes: 0
Reputation: 2462
I had a similar issue with a Structured Streaming app on EMR, and disk space rapidly increasing to the point of stalling/crashing application.
In my case the fix was to disable the Spark Event log:
spark.eventLog.enabled
to false
Upvotes: 2