Kafka - From JSON records to Parquet files in S3

Question

Correct me if I'm wrong.. A parquet file is self describing, means it contains its proper schema.

I would like to use S3 sink confluent connector ( especially because it handles correctly the Exactly Once semantic with S3) to read JSON records from our Kafka and then create parquet files in s3 ( partitioned by event time). Our JSON records don't have a schema embedded.

I know it's not supported yet, but I have few questions regarding parquet and AVRO as well.

As there is no schema embedded in our JSON records, it would mean that the connector task would have to infer the data from the JSON fields it self ? ( is that a doable solution ?)

There is no such thing like schema registry but for parquet, in Kafka.., is that right ?

AVRO seems well integrated to Kafka, means the schema is read using schema registry.. Does it mean the confluent S3 sink will be smart enough to create files in s3 containing the schema as header and then bunch of records in s3 files ?

I know that guy was working on an implementation of parquet for this s3 sink connector :

https://github.com/confluentinc/kafka-connect-storage-cloud/pull/172

But I don't understand, it seems it's using AVRO schema in the code, does this imply having AVRO records in Kafka to use this Parquet implementation ?

I'm starting to think that it would be easier to target AVRO files on S3 ( I can afford it by loosing some OLAP capabilities), but wanted to be sure before going AVRO.

Kafka - From JSON records to Parquet files in S3

Answers (1)

Related Questions