Difference between Flume HDFS Sink Flush and Roll

Question

I came across two configuration properties of HDFS Sink in Flume documentation:

hdfs.rollCount  Number of events written to file before it rolled (0 = never roll based on number of events)

and

hdfs.batchSize  number of events written to file before it is flushed to HDFS

I want to know the difference between these two properties, and difference of roll and flush as well. It seems they look the same to me.

Jiayu Ji · Accepted Answer

In HDFS Sink, roll means closing the current file and writing coming events to a new file. There are three different ways of rolling in this sink which are rollCount, rollInterval and rollSize.

Batch is used to determine how often you want to commit from the channel. This helps significantly when you are using a file channel. Since each commit will remove the event(s) from channel, Less commit calls results in less random I/O to disk and better throughput.

Difference between Flume HDFS Sink Flush and Roll

Answers (2)

Related Questions