tuxmobil
tuxmobil

Reputation: 308

How to read parquet file in parallel with a java code

Is it possible to read a parquet file in parallel ?

I'm using something similar to what is described here (based on AvroParquetReader): how to read a parquet file, in a standalone java code? but this is done in sequence not in parallel.

Cheers !

Upvotes: 5

Views: 1016

Answers (2)

Ashley Deaner
Ashley Deaner

Reputation: 21

Still fresh to parquet files, I found opening the parquet file as Spark datasets to list parallelized stream to be faster:

// spark dataset to list parallel foreach
String PATH_SCHEMA = "s3a://" + bucket + "/" + key;
Path path = new Path(PATH_SCHEMA);

SparkSession spark = SparkSession.builder().master("local[1]").appName("example.com").getOrCreate();
Dataset<Row> ds = spark.read().parquet(path.toString());
ds.collectAsList().parallelStream().forEach(Class::method);

Upvotes: 2

tuxmobil
tuxmobil

Reputation: 308

The only way I found is to have an executor pool and every worker of this pool reads one of the parquet file.

Upvotes: 1

Related Questions