custom Flume interceptor: intercept() method called multiple times for the same Event

Question

TL;DR

When a Flume source fails to push a transaction to the next channel in the pipeline, does it always keep event instances for the next try?

In general, is it safe to have a stateful Flume interceptor, where processing of events depends on previously processed events?

Full problem description:

I am considering the possibility of leveraging guarantees offered by Apache Kafka regarding the way topic partitions are distributed among consumers in a consumer group to perform streaming deduplication in an existing Flume-based log consolidation architecture.

Using the Kafka Source for Flume and custom routing to Kafka topic partitions, I can ensure that every event that should go to the same logical "deduplication queue" will be processed by a single Flume agent in the cluster (for as long as there are no agent stops/starts within the cluster). I have the following setup using a custom-made Flume interceptor:

[KafkaSource with deduplication interceptor]-->()MemoryChannel)-->[HDFSSink]

It seems that when the Flume Kafka source runner is unable to push a batch of events to the memory channel, the event instances that are part of the batch are passed again to my interceptor's intercept() method. In this case, it was easy to add a tag (in the form of a Flume event header) to processed events to distinguish actual duplicates from events in a failed batch that got re-processed.

However, I would like to know if there is any explicit guarantee that Event instances in failed transactions are kept for the next try or if there is the possibility that events are read again from the actual source (in this case, Kafka) and re-built from zero. In that case, my interceptor will consider those events to be duplicates and discard them, even though they were never delivered to the channel.

EDIT

This is how my interceptor distinguishes an Event instance that was already processed from a non-processed event:

public Event intercept(Event event) {
  Map headers = event.getHeaders();
  // tagHeaderName is the name of the header used to tag events, never null
  if( !tagHeaderName.isEmpty() ) {
    // Don't look further if event was already processed...
    if( headers.get(tagHeaderName)!=null )
      return event;
    // Mark it as processed otherwise...
    else
      headers.put(tagHeaderName, "");
  }
  // Continue processing of event...
}

custom Flume interceptor: intercept() method called multiple times for the same Event

Answers (1)

Related Questions