surajz
surajz

Reputation: 3611

configure flume to watch a directory for new logs

I am trying to configure flume to watch hadoop task log directories, so when ever a new job starts task log is streamed to flume and it filters some events log and sends it to somewhere (while job is still running).

Is there a Flume source that can be used? something like exec source tail, but the full file path is not known when the flume agent is started. I think spool directory cannot be used here because I need to scan the logs as they are written.

Upvotes: 2

Views: 2657

Answers (1)

Erik Schmiegelow
Erik Schmiegelow

Reputation: 2759

Yes in fact spool source will do the job. Here's a sample config:

SpoolAgent.sources = MySpooler
SpoolAgent.channels = MemChannel
SpoolAgent.sinks = HDFS

SpoolAgent.channels.MemChannel.type = memory
SpoolAgent.channels.MemChannel.capacity = 500
SpoolAgent.channels.MemChannel.transactionCapacity = 200

SpoolAgent.sources.MySpooler.channels = MemChannel
SpoolAgent.sources.MySpooler.type = spooldir
SpoolAgent.sources.MySpooler.spoolDir = /var/log/hadoop/
SpoolAgent.sources.MySpooler.fileHeader = true

SpoolAgent.sinks.HDFS.channel = MemChannel
SpoolAgent.sinks.HDFS.type = hdfs
SpoolAgent.sinks.HDFS.hdfs.path = hdfs://cluster/logs/%{file}
SpoolAgent.sinks.HDFS.hdfs.fileType = DataStream
SpoolAgent.sinks.HDFS.hdfs.writeFormat = Text
SpoolAgent.sinks.HDFS.hdfs.batchSize = 100
SpoolAgent.sinks.HDFS.hdfs.rollSize = 0
SpoolAgent.sinks.HDFS.hdfs.rollCount = 0
SpoolAgent.sinks.HDFS.hdfs.rollInterval = 3000

fileHeader prop will set a header with the name of the file, which is referenced in the HDFS-Sink path. This will route the events to the corresponding file in HDFS.

Upvotes: 4

Related Questions