Reputation: 308
Is it possible to read a parquet file in parallel ?
I'm using something similar to what is described here (based on AvroParquetReader): how to read a parquet file, in a standalone java code? but this is done in sequence not in parallel.
Cheers !
Upvotes: 5
Views: 1016
Reputation: 21
Still fresh to parquet files, I found opening the parquet file as Spark datasets to list parallelized stream to be faster:
// spark dataset to list parallel foreach
String PATH_SCHEMA = "s3a://" + bucket + "/" + key;
Path path = new Path(PATH_SCHEMA);
SparkSession spark = SparkSession.builder().master("local[1]").appName("example.com").getOrCreate();
Dataset<Row> ds = spark.read().parquet(path.toString());
ds.collectAsList().parallelStream().forEach(Class::method);
Upvotes: 2
Reputation: 308
The only way I found is to have an executor pool and every worker of this pool reads one of the parquet file.
Upvotes: 1