Reputation: 7800
How can I read a subset of fields from an avro-parquet file in java?
I thought I could define an avro schema which is a subset of the stored records and then read them...but I get an exception.
here is how i tried to solve it
I have 2 avro schemas:
The fields of ClassB are a subset of ClassA.
final Builder<ClassB> builder = AvroParquetReader.builder(files[0].getPath());
final ParquetReader<ClassB> reader = builder.build();
//AvroParquetReader<ClassA> readerA = new AvroParquetReader<ClassA>(files[0].getPath());
ClassB record = null;
final List<ClassB> list = new ArrayList<>();
while ((record = reader.read()) != null) {
list.add(record);
}
But I get a ClassCastException
on line (record=reader.read())
: Cannot convert ClassA to ClassB
I suppose the reader is reading the schema from the file.
I tried to send in the model (i.e. builder.withModel
) but since classB extends org.apache.avro.specific.SpecificRecordBase
it throws an exception.
I event tried to set the schema in the configuration and set it through builder.withConfig
but no cigar...
Upvotes: 2
Views: 7562
Reputation: 7800
So...
Couple of things:
AvroReadSupport.setRequestedProjection(hadoopConf, ClassB.$Schema)
can be used to set a projection for the columns that are selected.reader.readNext
method still will return a ClassA
object but will null out the fields that are not present in ClassB
.To use the reader directly you can do the following:
AvroReadSupport.setRequestedProjection(hadoopConf, ClassB.SCHEMA$);
final Builder<ClassB> builder = AvroParquetReader.builder(files[0].getPath());
final ParquetReader<ClassA> reader = builder.withConf(hadoopConf).build();
ClassA record = null;
final List<ClassA> list = new ArrayList<>();
while ((record = reader.read()) != null) {
list.add(record);
}
Also if you're planning to use an inputformat to read the avro-parquet file, there is a convenience method - here is a spark example:
final Job job = Job.getInstance(hadoopConf);
ParquetInputFormat.setInputPaths(job, pathGlob);
AvroParquetInputFormat.setRequestedProjection(job, ClassB.SCHEMA$);
@SuppressWarnings("unchecked")
final JavaPairRDD<Void, ClassA> rdd = sc.newAPIHadoopRDD(job.getConfiguration(), AvroParquetInputFormat.class,
Void.class, ClassA.class);
Upvotes: 3