oikonomiyaki
oikonomiyaki

Reputation: 7951

Difference between Flume HDFS Sink Flush and Roll

I came across two configuration properties of HDFS Sink in Flume documentation:

hdfs.rollCount  Number of events written to file before it rolled (0 = never roll based on number of events)

and

hdfs.batchSize  number of events written to file before it is flushed to HDFS

I want to know the difference between these two properties, and difference of roll and flush as well. It seems they look the same to me.

Upvotes: 0

Views: 1961

Answers (2)

Gongqin Shen
Gongqin Shen

Reputation: 751

Roll means that the sink will close the current file by removing hdfs.inUseSuffix, which is "tmp" by default, from the file name, and write incoming events to a new file until it again reaches the limit and the whole cycle continues on.

Flush means writing N files cached in the memory buffer at once to HDFS, where N is defined in hdfs.batchSize. For example, if hdfs.batchSize is defined as 100, instead of 100 separate IO operations, only one big IO operation would occur that writes out all the 100 files at once to reduce the IO overhead to open and close the streams.

hdfs.rollCount defines the maximum number events in each file and hdfs.batchSize defines the maximum events in the memory buffer. In certain scenarios, rolling and flushing happen before the thresholds are reached, for instance, when flume agent is shutting down, the current file will be closed without necessarily containing hdfs.rollCount events, and all the remaining events in the memory buffer will be flushed out to HDFS.

Upvotes: 3

Jiayu Ji
Jiayu Ji

Reputation: 46

In HDFS Sink, roll means closing the current file and writing coming events to a new file. There are three different ways of rolling in this sink which are rollCount, rollInterval and rollSize.

Batch is used to determine how often you want to commit from the channel. This helps significantly when you are using a file channel. Since each commit will remove the event(s) from channel, Less commit calls results in less random I/O to disk and better throughput.

Upvotes: 3

Related Questions