iuliandumitru
iuliandumitru

Reputation: 171

Spark - write Avro file

What are the common practices to write Avro files with Spark (using Scala API) in a flow like this:

  1. parse some logs files from HDFS
  2. for each log file apply some business logic and generate Avro file (or maybe merge multiple files)
  3. write Avro files to HDFS

I tried to use spark-avro, but it doesn't help much.

val someLogs = sc.textFile(inputPath)

val rowRDD = someLogs.map { line =>
  createRow(...)
}

val sqlContext = new SQLContext(sc)
val dataFrame = sqlContext.createDataFrame(rowRDD, schema)
dataFrame.write.avro(outputPath)

This fails with error:

org.apache.spark.sql.AnalysisException: 
      Reference 'StringField' is ambiguous, could be: StringField#0, StringField#1, StringField#2, StringField#3, ...

Upvotes: 8

Views: 38244

Answers (3)

Spidey Praful
Spidey Praful

Reputation: 59

You need to start spark shell to include avro package..recommended for lower versions

$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-avro_2.11:4.0.0

Then use to df to write as avro file-

dataframe.write.format("com.databricks.spark.avro").save(outputPath)

And write as avro table in hive -

dataframe.write.format("com.databricks.spark.avro").saveAsTable(hivedb.hivetable_avro)

Upvotes: 2

Debaditya
Debaditya

Reputation: 2497

Spark 2 and Scala 2.11

import com.databricks.spark.avro._
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().master("local").getOrCreate()

// Do all your operations and save it on your Dataframe say (dataFrame)

dataFrame.write.avro("/tmp/output")

Maven dependency

<dependency>
    <groupId>com.databricks</groupId>
    <artifactId>spark-avro_2.11</artifactId>
    <version>4.0.0</version> 
</dependency>

Upvotes: 3

Sudheer Palyam
Sudheer Palyam

Reputation: 2519

Databricks provided library spark-avro, which helps us in reading and writing Avro data.

dataframe.write.format("com.databricks.spark.avro").save(outputPath)

Upvotes: 15

Related Questions