Kafka topic data to HDFS parquet file using HDFS sink connector configuration issue

Question

I need help regarding a kafka topic that I would like to put into HDFS in parquet format (with daily partitionner).

I have a lot of data in a kafka topic which are basically json data like this :

{"title":"Die Hard","year":1988,"cast":["Bruce Willis","Alan Rickman","Bonnie Bedelia","William Atherton","Paul Gleason","Reginald VelJohnson","Alexander Godunov"],"genres":["Action"]}
{"title":"Toy Story","year":1995,"cast":["Tim Allen","Tom Hanks","(voices)"],"genres":["Animated"]}
{"title":"Jurassic Park","year":1993,"cast":["Sam Neill","Laura Dern","Jeff Goldblum","Richard Attenborough"],"genres":["Adventure"]}
{"title":"The Lord of the Rings: The Fellowship of the Ring","year":2001,"cast":["Elijah Wood","Ian McKellen","Liv Tyler","Sean Astin","Viggo Mortensen","Orlando Bloom","Sean Bean","Hugo Weaving","Ian Holm"],"genres":["Fantasy »]}
{"title":"The Matrix","year":1999,"cast":["Keanu Reeves","Laurence Fishburne","Carrie-Anne Moss","Hugo Weaving","Joe Pantoliano"],"genres":["Science Fiction"]}

This topic's name is : test

And I would like to put those data into my HDFS cluster in parquet format. But I struggle with the sink connector configuration. I use the confluent hdfs-sink-connector for that.

Here is what I manage to do so far :

{
  "name": "hdfs-sink",
  "config": {
    "name": "hdfs-sink",
    "connector.class": "io.confluent.connect.hdfs.HdfsSinkConnector",
    "tasks.max": "1",
    "topics": "test",
    "hdfs.url": "hdfs://hdfs-IP:8020",
    "hadoop.home": "/user/test-user/TEST",
    "flush.size": "3",
    "locale": "fr-fr",
    "timezone": "UTC",
    "format.class": "io.confluent.connect.hdfs.parquet.ParquetFormat",
    "partitioner.class": "io.confluent.connect.hdfs.partitioner.DailyPartitioner",
    "consumer.auto.offset.reset": "earliest",
    "value.converter":  "org.apache.kafka.connect.json.JsonConverter",
    "key.converter": "org.apache.kafka.connect.json.JsonConverter",
    "key.converter.schemas.enable": "true",
    "value.converter.schemas.enable": "true"
    
  }
}

Some explanation on why I configured the connector like that :

I have a lot of those data that populate my topic every day
Final goal is to have one parquet file per day in my HDFS for this topic

I understood that maybe I have to use the schema-registry for formatting the data into parquet but I don't know how to do that. And is it necessary?

Can you please help me on that?

Thank you

Kafka topic data to HDFS parquet file using HDFS sink connector configuration issue

Answers (1)

Related Questions