dinesh028
dinesh028

Reputation: 2187

Too many small files HDFS Sink Flume

agent.sinks=hpd
agent.sinks.hpd.type=hdfs
agent.sinks.hpd.channel=memoryChannel
agent.sinks.hpd.hdfs.path=hdfs://master:9000/user/hduser/gde
agent.sinks.hpd.hdfs.fileType=DataStream
agent.sinks.hpd.hdfs.writeFormat=Text
agent.sinks.hpd.hdfs.rollSize=0
agent.sinks.hpd.hdfs.batchSize=1000
agent.sinks.hpd.hdfs.fileSuffix=.i  
agent.sinks.hpd.hdfs.rollCount=1000
agent.sinks.hpd.hdfs.rollInterval=0

I'm trying to use HDFS Sink to write events to HDFS. And have tried Size, Count and Time bases rolling but none is working as expected. It is generating too many small files in HDFS like:

-rw-r--r--   2 hduser supergroup      11617 2016-03-05 19:37 hdfs://master:9000/user/hduser/gde/FlumeData.1457186832879.i
-rw-r--r--   2 hduser supergroup       1381 2016-03-05 19:37 hdfs://master:9000/user/hduser/gde/FlumeData.1457186832880.i
-rw-r--r--   2 hduser supergroup        553 2016-03-05 19:37 hdfs://master:9000/user/hduser/gde/FlumeData.1457186832881.i
-rw-r--r--   2 hduser supergroup       2212 2016-03-05 19:37 hdfs://master:9000/user/hduser/gde/FlumeData.1457186832882.i
-rw-r--r--   2 hduser supergroup       1379 2016-03-05 19:37 hdfs://master:9000/user/hduser/gde/FlumeData.1457186832883.i
-rw-r--r--   2 hduser supergroup       2762 2016-03-05 19:37 hdfs://master:9000/user/hduser/gde/FlumeData.1457186832884.i.tmp

Please assist to resolve the given problem. I'm using flume 1.6.0

~Thanks

Upvotes: 0

Views: 963

Answers (2)

dinesh028
dinesh028

Reputation: 2187

My provided configurations were all correct. The reason behind such behavior was HDFS. I had 2 data nodes out of which one was down. So, files were not achieving minimum required replication. In Flume logs one can see below warning message too:

"Block Under-replication detected. Rotating file."

To remove this problem one can opt for any of below solution:-

  • Up the data node to achieve required replication of blocks, or
  • Set property hdfs.minBlockReplicas accordingly.

~Thanks

Upvotes: 1

Sudev Ambadi
Sudev Ambadi

Reputation: 655

You are now rolling the files for every 1000 items. You can try either of two methods mentioned below.

  1. Try increasing hdfs.rollCount to much higher value, this value decides number of events contained in each rolled file.
  2. Remove hdfs.rollCount and set hdfs.rollInterval to interval at which you want to roll your file. Say hdfs.rollInterval = 600 to roll file every 10 minutes.

For more information refer Flume Documentation

Upvotes: 0

Related Questions