Writing multiple parquet files in parallel

Question

I have a big Spark DataSet (Java) & I need to apply filter to get multiple dataset and write each dataset to a parquet.

Does Java Spark provide any feature where it can write all parquet files in parallel? I am trying to avoid it to do it sequentially.

Other option is use Java Thread, is there any other way to do it?

Piyush Patel · Accepted Answer

Spark will automatically write parquet files in parallel. It also depends on how many executor cores you provided as well as number of partitions of a dataframe. You can try using df.write.parquet("/location/to/hdfs") and see the time when those were written.

Writing multiple parquet files in parallel

Answers (2)

Related Questions