Reputation: 83
I am building an application in which I am querying SQL Server table using pyspark. My plan is to push this to kafka, which will be consumed by Google Cloud Storage Sink Connector & saved in Google Cloud Storage in avro format for further processing. I'm doing this because the data sync application we've built requires schema registry to do some of it's automations, so I cannot push pyspark avro file directly to GCS.
I have been able to push the data to kafka topic using pyspark, but cannot find any straightforward way to convert pyspark dataframe schema to avro schema, to be stored in schema registry.
I spent the past 2 hours searching, there doesn't seem to be any library doing this. Just codes that manually maintain a mapping between spark datatype and avro dataype. This link has such an example code.
But it's said in the link itself, there could be data types it is missing. So my question is, is there no better way of doing this than maintaining a mapping ourselves? Ideally I would prefer using a well maintained library for this.
Please let me know your thoughts
Upvotes: 0
Views: 722
Reputation: 191743
convert pyspark dataframe schema to avro schema,
See Javadoc for SchemaConverters
, toAvroType
https://spark.apache.org/docs/3.3.1/api/java/org/apache/spark/sql/avro/SchemaConverters.html
to be stored in schema registry.
That's automatically done when you use the KafkaAvroSerializer in the producer via a serialization UDF (do not use Spark's to_avro function, since that will break any consumer that requires the Schema Registry during deserialization)
spent the past 2 hours searching
Could narrow that down by searching the Spark source code itself for imports of the Avro Schema class or package, however none of that would help with the registry interaction. ABRiS is the only library I know of, that does
Upvotes: 1