itaied
itaied

Reputation: 7107

Flume to HDFS split a file to lots of files

I'm trying to transfer a 700 MB log file from flume to HDFS. I have configured the flume agent as follows:

...
tier1.channels.memory-channel.type = memory
...
tier1.sinks.hdfs-sink.channel = memory-channel
tier1.sinks.hdfs-sink.type = hdfs
tier1.sinks.hdfs-sink.path = hdfs://***
tier1.sinks.hdfs-sink.fileType = DataStream
tier1.sinks.hdfs-sink.rollSize = 0

The source is a spooldir, channel is memory and sink is hdfs.

I have also tried to send a 1MB file, and flume split it to 1000 files, each one of size of 1KB. Another thing I have noticed is that the transfer was very slow, 1MB took about 1 minute. Am I doing something wrong?

Upvotes: 1

Views: 1435

Answers (1)

Erik Schmiegelow
Erik Schmiegelow

Reputation: 2759

You need to disable the rolltimeout too, that's done with the following settings:

tier1.sinks.hdfs-sink.hdfs.rollCount = 0
tier1.sinks.hdfs-sink.hdfs.rollInterval = 300

rollcount prevents roll overs, rollIntervall here is set to 300 seconds, setting that to 0 will disable timeouts. You will have to chosse which mechanism you want for rollovers, otherwise Flume will only close the files upon shutdown.

The default values are the following:

hdfs.rollInterval   30  Number of seconds to wait before rolling current file (0 = never roll based on time interval)
hdfs.rollSize   1024    File size to trigger roll, in bytes (0: never roll based on file size)
hdfs.rollCount  10  Number of events written to file before it rolled (0 = never roll based on number of events)

Upvotes: 3

Related Questions