Reputation: 51
Can somebody share example of reading avro using java in spark?
Found scala examples but no luck with java.
Here is the code snippet which is part of code but running into compilation issues with the method ctx.newAPIHadoopFile
.
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
Configuration hadoopConf = new Configuration();
JavaRDD<SampleAvro> lines = ctx.newAPIHadoopFile(path, AvroInputFormat.class, AvroKey.class, NullWritable.class, new Configuration());
Regards
Upvotes: 2
Views: 5584
Reputation: 142
Here, assuming K is your Key and V is your value:
....
val job = new Job();
job.setInputFormatClass(AvroKeyValueInputFormat<K, V>.class);
FileInputFormat.addInputPaths(job, <inputPaths>);
AvroJob.setInputKeySchema(job, <keySchema>);
AvroJob.setInputValueSchema(job, <valueSchema>);
RDD<AvroKey<K>, AvroValue<V>> avroRDD =
sc.newAPIHadoopRDD(job.getConfiguration,
AvroKeyValueInputFormat<<K>, <V>>,
AvroKey<K>.class,
AvroValue<V>.class);
Upvotes: 1
Reputation: 2137
You can use the spark-avro connector library by Databricks.
The recommended way to read or write Avro data from Spark SQL is by using Spark's DataFrame APIs.
The connector enables both reading and writing Avro data from Spark SQL:
import org.apache.spark.sql.*;
SQLContext sqlContext = new SQLContext(sc);
// Creates a DataFrame from a specified file
DataFrame df = sqlContext.read().format("com.databricks.spark.avro")
.load("src/test/resources/episodes.avro");
// Saves the subset of the Avro records read in
df.filter($"age > 5").write()
.format("com.databricks.spark.avro")
.save("/tmp/output");
Note that this connector has different versions for Spark 1.2, 1.3, and 1.4+:
Spark verconnector
1.2
0.2.0
1.3
1.0.0
1.4+
2.0.1
Using Maven:
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.10</artifactId>
<version>{AVRO_CONNECTOR_VERSION}</version>
</dependency>
See further info at: Spark SQL Avro Library
Upvotes: 2