S3 Implementation for org.apache.parquet.io.InputFile?

Question

I am trying to write a Scala-based AWS Lambda to read Snappy compressed Parquet files based in S3. The process will write them backout in partitioned JSON files.

I have been trying to use the org.apache.parquet.hadoop.ParquetFileReader class to read the files... the non-deprecated way to do this appears to pass it a implementation of the org.apache.parquet.io.InputFile interface. There is one for Hadoop (HadoopInputFile)... but I cannot find one for S3. I also tried some of the deprecated ways for this class, but could not get them to work with S3 either.

Any solution to this dilemma?

Just in case anyone is interested... why I am doing this in Scala? Well... I cannot figure out another way to do it. The Python implementations for Parquet (pyarrow and fastparquet) both seem to struggle with complicated list/struct based schemas.

Also, I have seen some AvroParquetReader based code (Read parquet data from AWS s3 bucket) that might be a different solution, but I could not get these to work without a known schema. but maybe I am missing something there.

I'd really like to get the ParquetFileReader class to work, as it seem clean.

Appreciate any ideas.

S3 Implementation for org.apache.parquet.io.InputFile?

Answers (1)

Related Questions