Grbinho
Grbinho

Reputation: 408

AWS Glue ETL Job fails with AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'

I'm trying to create AWS Glue ETL Job that would load data from parquet files stored in S3 in to a Redshift table. Parquet files were writen using pandas with 'simple' file schema option into multiple folders in an S3 bucked. The layout looks like this:

s3://bucket/parquet_table/01/file_1.parquet

s3://bucket/parquet_table/01/file_2.parquet

s3://bucket/parquet_table/01/file_3.parquet

s3://bucket/parquet_table/01/file_1.parquet

s3://bucket/parquet_table/02/file_2.parquet

s3://bucket/parquet_table/02/file_3.parquet

I can use AWS Glue Crawler to create a table in the AWS Glue Catalog and that table can be queried from Athena, but it does not work when i try to create ETL Job that would copy the same table to Redshift.

If I Crawl a single file or if I crawl multiple files in one folder, it works, as soon as there are multiple folders involved, I get the above mentioned error

AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'

Similar issues appear if instead of 'simple' schema I use 'hive'. Then we have multiple folders and also empty parquet files that throw

java.io.IOException: Could not read footer: java.lang.RuntimeException: xxx is not a Parquet file (too small)

Is there some recommendation on how to read Parquet files and structure them ins S3 when using AWS Glue (ETL and Data Catalog)?

Upvotes: 3

Views: 3877

Answers (2)

Karan Hebbar
Karan Hebbar

Reputation: 176

The error that you're facing is because the when reading the parquet files from s3 from spark/glue it expects the data to be in hive partition i.e the partition names should have key- value pair, You'll to have the s3 hierarchy in hive style partition something like below

s3://your-bucket/parquet_table/id=1/file1.parquet

s3://your-bucket/parquet_table/id=2/file2.parquet

and so on..

then use the below path to read all the files in bucket

location : s3://your-bucket/parquet_table

If the data in s3 partition the above way, you'll not the face any issues.

Upvotes: 0

Zeitgeist
Zeitgeist

Reputation: 1520

Redshift doesn't support parquet format. Redshift Spectrum does. Athena also supports parquet format.

Upvotes: 0

Related Questions