Reputation: 22691
Using Apache Spark 1.6.4, with elasticsearch4hadoop plugin, I am exporting an elasticsearch index (100m documents, 100Go, 5 shards) into a gzipped parquet file, within HDFS 2.7.
I run this ETL as a Java program, with 1 executor (8 CPU, 12Go RAM).
The process of 5 tasks (because the 5 ES shards) takes about 1 hour, works fine most of the time, but sometime, I can see some Spark tasks failed because out of memory error
.
During the process, I can see in the HDFS some temporary files, but they are always 0 sized.
Q: I am wondering if Spark is saving the data in memory before writing the gz.parquet file?
Upvotes: 1
Views: 58