How Spark is writing compressed parquet file?

Question

Using Apache Spark 1.6.4, with elasticsearch4hadoop plugin, I am exporting an elasticsearch index (100m documents, 100Go, 5 shards) into a gzipped parquet file, within HDFS 2.7.

I run this ETL as a Java program, with 1 executor (8 CPU, 12Go RAM).

The process of 5 tasks (because the 5 ES shards) takes about 1 hour, works fine most of the time, but sometime, I can see some Spark tasks failed because out of memory error.

During the process, I can see in the HDFS some temporary files, but they are always 0 sized.

Q: I am wondering if Spark is saving the data in memory before writing the gz.parquet file?

How Spark is writing compressed parquet file?

Answers (0)

Related Questions