Reputation: 3611
I am trying to configure flume to watch hadoop task log directories, so when ever a new job starts task log is streamed to flume and it filters some events log and sends it to somewhere (while job is still running).
Is there a Flume source that can be used? something like exec source tail, but the full file path is not known when the flume agent is started. I think spool directory cannot be used here because I need to scan the logs as they are written.
Upvotes: 2
Views: 2657
Reputation: 2759
Yes in fact spool source will do the job. Here's a sample config:
SpoolAgent.sources = MySpooler
SpoolAgent.channels = MemChannel
SpoolAgent.sinks = HDFS
SpoolAgent.channels.MemChannel.type = memory
SpoolAgent.channels.MemChannel.capacity = 500
SpoolAgent.channels.MemChannel.transactionCapacity = 200
SpoolAgent.sources.MySpooler.channels = MemChannel
SpoolAgent.sources.MySpooler.type = spooldir
SpoolAgent.sources.MySpooler.spoolDir = /var/log/hadoop/
SpoolAgent.sources.MySpooler.fileHeader = true
SpoolAgent.sinks.HDFS.channel = MemChannel
SpoolAgent.sinks.HDFS.type = hdfs
SpoolAgent.sinks.HDFS.hdfs.path = hdfs://cluster/logs/%{file}
SpoolAgent.sinks.HDFS.hdfs.fileType = DataStream
SpoolAgent.sinks.HDFS.hdfs.writeFormat = Text
SpoolAgent.sinks.HDFS.hdfs.batchSize = 100
SpoolAgent.sinks.HDFS.hdfs.rollSize = 0
SpoolAgent.sinks.HDFS.hdfs.rollCount = 0
SpoolAgent.sinks.HDFS.hdfs.rollInterval = 3000
fileHeader prop will set a header with the name of the file, which is referenced in the HDFS-Sink path. This will route the events to the corresponding file in HDFS.
Upvotes: 4