alon
alon

Reputation: 91

Parquetloader: can't load multiple parquet files using pig

I'm getting the following error: Error during parsing. repetition constraint is more restrictive: can not merge type required binary MyTime into optional binary MyTime. Maybe one of the files is corrupted but I don't know how to skip it.

Thanks

Upvotes: 0

Views: 1355

Answers (1)

Neil Best
Neil Best

Reputation: 823

This happens when reading multiple parquet files that have slightly different metadata in their schemas. Either you have a mixed collection of files in a single directory or you are giving the LOAD statement a glob and the resulting collection of files is mixed in this respect.

Rather than specifying the schema in an AS() clause or making a bare call to the loader function the solution is to override the schema in the loader function's argument like this:

data = LOAD 'data' 
    USING parquet.pig.ParquetLoader( 'n1:int, n2:float, n3:double, n4:long')

Otherwise the loader function infers the schema from the first file it encounters which then conflicts with one of the others.

If you have still have trouble try using type bytearray in the schema specification and then cast to the desired types in a subsequent FOREACH.

According to the Parquet source code there is another argument to the loader function that allows columns to be specified by position rather than name (the default) but I have not experimented with that.

Upvotes: 2

Related Questions