Wanderer
Wanderer

Reputation: 1647

S3 Implementation for org.apache.parquet.io.InputFile?

I am trying to write a Scala-based AWS Lambda to read Snappy compressed Parquet files based in S3. The process will write them backout in partitioned JSON files.

I have been trying to use the org.apache.parquet.hadoop.ParquetFileReader class to read the files... the non-deprecated way to do this appears to pass it a implementation of the org.apache.parquet.io.InputFile interface. There is one for Hadoop (HadoopInputFile)... but I cannot find one for S3. I also tried some of the deprecated ways for this class, but could not get them to work with S3 either.

Any solution to this dilemma?

Just in case anyone is interested... why I am doing this in Scala? Well... I cannot figure out another way to do it. The Python implementations for Parquet (pyarrow and fastparquet) both seem to struggle with complicated list/struct based schemas.

Also, I have seen some AvroParquetReader based code (Read parquet data from AWS s3 bucket) that might be a different solution, but I could not get these to work without a known schema. but maybe I am missing something there.

I'd really like to get the ParquetFileReader class to work, as it seem clean.

Appreciate any ideas.

Upvotes: 2

Views: 3266

Answers (1)

Jörn Horstmann
Jörn Horstmann

Reputation: 34014

Hadoop uses its own filesystem abstraction layer, which has an implementation for s3 (https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#S3A).

The setup should look someting like the following (java, but same should work with scala):

Configuration conf = new Configuration();
conf.set(Constants.ENDPOINT, "https://s3.eu-central-1.amazonaws.com/");
conf.set(Constants.AWS_CREDENTIALS_PROVIDER,
    DefaultAWSCredentialsProviderChain.class.getName());
// maybe additional configuration properties depending on the credential provider


URI uri = URI.create("s3a://bucketname/path");
org.apache.hadoop.fs.Path path = new Path(uri);

ParquetFileReader pfr = ParquetFileReader.open(HadoopInputFile.fromPath(path, conf))

Upvotes: 2

Related Questions