yosi
yosi

Reputation: 639

avro events from kafka to HDFS with flume

I have kafka cluster that receives avro events from producers.

I would like to use flume in order to consume these events and put them as avro files in HDFS

Is this possible with flume?

Does anyone have example of a configuration file demonstrating how to do it?

Yosi

Upvotes: 3

Views: 2902

Answers (2)

Praveen L
Praveen L

Reputation: 987

Consider this scenario. For the avro events(only binary data, without schema) coming from kafka, below is the agent which worked for me.

Schema will be added at sink side using below agent.

#source
MY_AGENT.sources.my-source.type = org.apache.flume.source.kafka.KafkaSource
MY_AGENT.sources.my-source.channels = my-channel
MY_AGENT.sources.my-source.batchSize = 10000
MY_AGENT.sources.my-source.useFlumeEventFormat = false
MY_AGENT.sources.my-source.batchDurationMillis = 5000
MY_AGENT.sources.my-source.kafka.bootstrap.servers =${BOOTSTRAP_SERVERS}
MY_AGENT.sources.my-source.kafka.topics = my-topic
MY_AGENT.sources.my-source.kafka.consumer.group.id = my-topic_grp
MY_AGENT.sources.my-source.kafka.consumer.client.id = my-topic_clnt
MY_AGENT.sources.my-source.kafka.compressed.topics = my-topic
MY_AGENT.sources.my-source.kafka.auto.commit.enable = false
MY_AGENT.sources.my-source.kafka.consumer.session.timeout.ms=100000
MY_AGENT.sources.my-source.kafka.consumer.request.timeout.ms=120000
MY_AGENT.sources.my-source.kafka.consumer.max.partition.fetch.bytes=704857
MY_AGENT.sources.my-source.kafka.consumer.auto.offset.reset=latest

#channel
MY_AGENT.channels.my-channel.type = memory
MY_AGENT.channels.my-channel.capacity = 100000000
MY_AGENT.channels.my-channel.transactionCapacity = 100000
MY_AGENT.channels.my-channel.parseAsFlumeEvent = false

#Sink
MY_AGENT.sinks.my-sink.channel = my-channel
MY_AGENT.sinks.my-sink.type = hdfs
MY_AGENT.sinks.my-sink.hdfs.writeFormat= Text
MY_AGENT.sinks.my-sink.hdfs.fileType = DataStream
MY_AGENT.sinks.my-sink.hdfs.kerberosPrincipal =${user}
MY_AGENT.sinks.my-sink.hdfs.kerberosKeytab =${keytab}
MY_AGENT.sinks.my-sink.hdfs.useLocalTimeStamp = true
MY_AGENT.sinks.my-sink.hdfs.path = hdfs://nameservice1/my_hdfs/my_table1/timestamp=%Y%m%d
MY_AGENT.sinks.my-sink.hdfs.rollCount=0
MY_AGENT.sinks.my-sink.hdfs.rollSize=0
MY_AGENT.sinks.my-sink.hdfs.batchSize=100000
MY_AGENT.sinks.my-sink.hdfs.maxOpenFiles=2000
MY_AGENT.sinks.my-sink.hdfs.callTimeout=50000
MY_AGENT.sinks.my-sink.hdfs.fileSuffix=.avro

MY_AGENT.sinks.my-sink.serializer = org.apache.flume.sink.hdfs.AvroEventSerializer$Builder
MY_AGENT.sinks.my-sink.serializer.schemaURL = hdfs://nameservice1/my_hdfs/avro_schemas/${AVSC_FILE}

Few things which I want to highlight.

MY_AGENT.sinks.my-sink.hdfs.writeFormat= Text .. helps in dumping only the data which is coming from Flume event(ignoring the flume event headers.... )

MY_AGENT.sinks.my-sink.serializer.schemaURL = hdfs://nameservice1/my_hdfs/avro_schemas/${AVSC_FILE} .. need to pass the appropriate schema (which will to be added to the binary data in the avro file). The final output file in hdfs will have schema + data.

After storing the data in HDFS, created the hive table with appropriate avro schema and I am able to access the data as expected.

Upvotes: 0

Phillip Mann
Phillip Mann

Reputation: 937

This is indeed possible.

If you wish to consume from Kafka, then you need to set up a Kafka source and an HDFS sink that will use Avro.

Here is the link to the configuration options for a Kafka source: http://flume.apache.org/FlumeUserGuide.html#kafka-source

It is pretty straight forward to set up the the source configuration. You'll of course need to test this out to verify that the settings you've chosen perform well with your system.

To set up HDFS with Avro, you need to set up an HDFS sink and you're in luck, this site describes how to do so: http://thisdataguy.com/2014/07/28/avro-end-to-end-in-hdfs-part-2-flume-setup/

Lastly, you need to configure a channel. I have experience using Flume's memory channel with default settings (I believe... unable to check right now) and it has worked great.

I recommend you spend time with the Flume documentation: http://flume.apache.org/FlumeUserGuide.html as all of this information is contained there. It's important to understand the system you are working with before you set up a Flume agent to process data.

Upvotes: 1

Related Questions