ASHISH M.G
ASHISH M.G

Reputation: 792

Avro schema ( .avsc ) enforcement in Pyspark

Can anyone help me with reading a avro schema (.avsc ) through Pyspark and enforcing it while writing the dataframe to a target storage ? All my targetr table schemas are provided as .avsc files and I need to provide this custom schema while saving my dataframe in Pyspark. I know there are libraries like spark-avro from databricks. But all examples are given in Scala.

Upvotes: 0

Views: 3335

Answers (1)

Matt
Matt

Reputation: 650

With this file /tmp/test.avsc

{
     "type": "record",
     "namespace": "com.example",
     "name": "FullName",
     "fields": [
       { "name": "first", "type": "string" },
       { "name": "last", "type": "string" }
     ]
}

and a dataframe like this:

df = spark.createDataFrame([{"first": "john", "last": "parker" }], StructType([StructField("first", StringType()), StructField("last", StringType())]))

resulting to this:

+-----+------+
|first|  last|
+-----+------+
| john|parker|
+-----+------+

you can do this to enforce write schema:

jsonFormatSchema = open("/tmp/test.avsc", "r").read() 
df.write.format("avro").options(avroSchema=jsonFormatSchema).save("/tmp/avro")

and similiar to enforce read schema:

spark.read.format('avro').options(avroSchema=jsonFormatSchema).load("/tmp/avro")

More information is available here, where btw there are more than enough python examples: https://spark.apache.org/docs/latest/sql-data-sources-avro.html

Upvotes: 4

Related Questions