Reputation: 1408
Correct me if I'm wrong.. A parquet file is self describing, means it contains its proper schema.
I would like to use S3 sink confluent connector ( especially because it handles correctly the Exactly Once semantic with S3) to read JSON records from our Kafka and then create parquet files in s3 ( partitioned by event time). Our JSON records don't have a schema embedded.
I know it's not supported yet, but I have few questions regarding parquet and AVRO as well.
As there is no schema embedded in our JSON records, it would mean that the connector task would have to infer the data from the JSON fields it self ? ( is that a doable solution ?)
There is no such thing like schema registry but for parquet, in Kafka.., is that right ?
AVRO seems well integrated to Kafka, means the schema is read using schema registry.. Does it mean the confluent S3 sink will be smart enough to create files in s3 containing the schema as header and then bunch of records in s3 files ?
I know that guy was working on an implementation of parquet for this s3 sink connector :
https://github.com/confluentinc/kafka-connect-storage-cloud/pull/172
But I don't understand, it seems it's using AVRO schema in the code, does this imply having AVRO records in Kafka to use this Parquet implementation ?
I'm starting to think that it would be easier to target AVRO files on S3 ( I can afford it by loosing some OLAP capabilities), but wanted to be sure before going AVRO.
Upvotes: 4
Views: 2715
Reputation: 191743
Correct me if I'm wrong.. A parquet file is self describing, means it contains its proper schema
Correct. If you have a parquet file, you can get the schema from it.
How do I get schema / column names from parquet file?
create files in s3 containing the schema as header and then bunch of records in s3 files ?
Yes, that's exactly how the S3 Connector works for Avro files.
it seems it's using AVRO schema in the code, does this imply having AVRO records in Kafka to use this Parquet implementation ?
I've not looked too extensively at the PR, but I think the Parquet storage format only requires a Connect Schema, not Avro data because using the AvroData class, it's possible to translate back and forth between Connect Schemas and Avro schemas like avroData.fromConnectSchema(schema)
. This parses the Connect Schema structure and forms a new Avro schema, and doesn't work against the Registry or require input data to be Avro.
That being said, if your JSON objects did have a schema, then it might be possible to write them with options other JSONFormat because the format.class
setting gets applied after the Converter. Anecdotally, I know I was able to write Avro input records out as JSON files with AvroConverter + JSONFormat, but I've not tried using JSONConverter + schema'd JSON with AvroFormat.
Update: Read docs
You must use ProtobufConverter or JSONSchemaConverter to get Parquet output. JSONConverter (with or without schemas) will not work.
I'm starting to think that it would be easier to target AVRO files on S3
Probably... Note, you could use Secor instead, which has Hive table integration and claims to have Parquet support for JSON
Upvotes: 3