Reputation: 364
I am trying to read na Avro
file in spark job.
My spark version is 1.6.0
(spark-core_2.10-1.6.0-cdh5.7.1).
Here is my java code:
JavaSparkContext sc = new JavaSparkContext(new SparkConf().setAppName("ReadAvro"));
JavaPairRDD <NullWritable, Text> lines = sc.newAPIHadoopFile(args[0],AvroKeyValueInputFormat.class,AvroKey.class,AvroValue.class,new Configuration());
But I am getting a compile time exception:
The method newAPIHadoopFile(String, Class, Class, Class, Configuration) in the type JavaSparkContext is not applicable for the arguments (String, Class, Class, Class, Configuration)
So what is the correct way to use JavaSparkContext.newAPIHadoopFile()
in Java?
Upvotes: 0
Views: 1487
Reputation: 476
public class Utils {
public static <T> JavaPairRDD<String, T> loadAvroFile(JavaSparkContext sc, String avroPath) {
JavaPairRDD<AvroKey, NullWritable> records = sc.newAPIHadoopFile(avroPath, AvroKeyInputFormat.class, AvroKey.class, NullWritable.class, sc.hadoopConfiguration());
return records.keys()
.map(x -> (GenericRecord) x.datum())
.mapToPair(pair -> new Tuple2<>((String) pair.get("key"), (T)pair.get("value")));
}
}
Use the utility as:
JavaPairRDD<String, YourAvroClassName> records = Utils.<YourAvroClassName>loadAvroFile(sc, inputDir);
You might also need to use KryoSerializer and register your custom KryoRegistrator:
sparkConf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer");
sparkConf.set("spark.kryo.registrator", "com.test.avro.MyKryoRegistrator");
public class MyKryoRegistrator implements KryoRegistrator {
public static class SpecificInstanceCollectionSerializer<T extends Collection> extends CollectionSerializer {
Class<T> type;
public SpecificInstanceCollectionSerializer(Class<T> type) {
this.type = type;
}
@Override
protected Collection create(Kryo kryo, Input input, Class<Collection> type) {
return kryo.newInstance(this.type);
}
@Override
protected Collection createCopy(Kryo kryo, Collection original) {
return kryo.newInstance(this.type);
}
}
Logger logger = LoggerFactory.getLogger(this.getClass());
@Override
public void registerClasses(Kryo kryo) {
// Avro POJOs contain java.util.List which have GenericData.Array as their runtime type
// because Kryo is not able to serialize them properly, we use this serializer for them
kryo.register(GenericData.Array.class, new SpecificInstanceCollectionSerializer<>(ArrayList.class));
kryo.register(YourAvroClassName.class);
}
}
Hope this helps...
Upvotes: 3