Reputation: 171
What are the common practices to write Avro files with Spark (using Scala API) in a flow like this:
I tried to use spark-avro, but it doesn't help much.
val someLogs = sc.textFile(inputPath)
val rowRDD = someLogs.map { line =>
createRow(...)
}
val sqlContext = new SQLContext(sc)
val dataFrame = sqlContext.createDataFrame(rowRDD, schema)
dataFrame.write.avro(outputPath)
This fails with error:
org.apache.spark.sql.AnalysisException:
Reference 'StringField' is ambiguous, could be: StringField#0, StringField#1, StringField#2, StringField#3, ...
Upvotes: 8
Views: 38244
Reputation: 59
You need to start spark shell to include avro package..recommended for lower versions
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-avro_2.11:4.0.0
Then use to df to write as avro file-
dataframe.write.format("com.databricks.spark.avro").save(outputPath)
And write as avro table in hive -
dataframe.write.format("com.databricks.spark.avro").saveAsTable(hivedb.hivetable_avro)
Upvotes: 2
Reputation: 2497
Spark 2 and Scala 2.11
import com.databricks.spark.avro._
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().master("local").getOrCreate()
// Do all your operations and save it on your Dataframe say (dataFrame)
dataFrame.write.avro("/tmp/output")
Maven dependency
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>4.0.0</version>
</dependency>
Upvotes: 3
Reputation: 2519
Databricks provided library spark-avro, which helps us in reading and writing Avro data.
dataframe.write.format("com.databricks.spark.avro").save(outputPath)
Upvotes: 15