Reputation: 661
I have a big Spark DataSet
(Java) & I need to apply filter to get multiple dataset and write each dataset to a parquet.
Does Java Spark provide any feature where it can write all parquet files in parallel? I am trying to avoid it to do it sequentially.
Other option is use Java Thread
, is there any other way to do it?
Upvotes: 1
Views: 2031
Reputation: 4045
Yes by default Spark provide parallelism using Spark Executors, but if want to achieve parallelism on Driver too, then you can do something like:
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
import java.util.ArrayList;
import java.util.List;
public class ParallelSparkWrite {
public static void main(String[] args) {
SparkSession spark = Constant.getSparkSess();
Dataset<Row> ds = spark.read().json("input/path");
List<String> filterValue = new ArrayList<>();
//Create a parallel stream
filterValue.parallelStream()
.forEach(filter -> {
//Filter your DataSet and write in parallel
ds.filter(ds.col("col1").equalTo(filter)).write().json("/output/path/"+filter+".json");
});
}
}
Upvotes: 1
Reputation: 1751
Spark will automatically write parquet files in parallel. It also depends on how many executor cores you provided as well as number of partitions of a dataframe. You can try using df.write.parquet("/location/to/hdfs")
and see the time when those were written.
Upvotes: 1