Use of Streaming in Hadoop

I am trying to study this Flink CEP example. I do see that in this example,they have created an single application (which is kind of streaming application) which is producing & consuming the data & applying pattern matching on the data. They have not put an streaming layer in between (like Kafka).Till now,single application is enough to fit for this purpose which makes it very optimized. Now,I know that if I use Kafka,then I will require 2 applications; one for ingesting data into Kafka topic & other for consuming data from Kafka topic..I have few questions which I am not getting answered ::

Why they have not use any streaming layer (like Kafka) in this example ??
When & where streaming is required??
Referring to the Flink CEP example,I want to know where & how streaming layer (like Kafka/Kinesis) will come into play ??
What will be the advantages/disadvantages if streaming layer like Kafka/Kinesis) comes in between ??

Upvotes: 0

Answers (1)

Ivan Mushketyk

Reputation: 8305

Let me answer your questions one-by-one.

Why they have not use any streaming layer (like Kafka) in this example?

I think you have a misconception about streaming in Flink. First of all Flink is streams processing engine. Basically everything that Flink is processing is a stream.

You many know that Flink can work either in stream or batch modes, but for Flink batch is just a special case of a stream that has a finite length, while streams are usually infinite. So everything is a stream of events in Flink. So the question is where Flink gets data from.

Flink can read data from a number of sources and Kafka is one of the sources that can be used in Flink. Take a look at this and this folder in Flink repository. They contain implements different sources in Flink including Kafka, Kinesis, RabbitMQ and so on. From the Flink perspective it does not matter if data is coming from an external system, is read from a file or if it is being generated.

A Flink user can implement his/her source of data that will be used by Flink runtime. To do this one need to extend RichSourceFunction class and implement the run method. For example this data source will generate an infinite stream of numbers starting from 0:

public class DummySource extends  RichParallelSourceFunction<Integer> {
  public void run(SourceContext<Integer> sourceContext) throws Exception {
    // You can specify custom termination conditions
    // the source should not be inifite
    int i = 0;
    while (true) {
      // provide an event for Flink processing
      sourceContext.collect(i);
      i++;
    }
  }
}

Since it does not matter what data source to use the author of the tutorial decided to simplify the example and use a simple data source that generates data using a random number generator:

MonitoringEvent monitoringEvent;

int rackId = random.nextInt(shard) + offset;
if (random.nextDouble() >= temperatureRatio) {
  double power = random.nextGaussian() * powerStd + powerMean;
  monitoringEvent = new PowerEvent(rackId, power);
} else {
  double temperature = random.nextGaussian() * temperatureStd + temperatureMean;
  monitoringEvent = new TemperatureEvent(rackId, temperature);
}

sourceContext.collect(monitoringEvent);

While in reality you would read events data from an external system like Kafka or Kinesis the example is intentionally simplistic to show the gist of the CEP library.

When & where streaming is required?

If by "streaming" you mean not-batch, then it is safe to say that it should be used when events are constantly being received and you need close-to-real time processing time.

If you are asking when you should use Kafka then you can use it for processing stream of events, use it as a message broker, use it for log aggregation and so on. Here is a list of use cases for which you can use Apache Kafka.

Referring to the Flink CEP example,I want to know where & how streaming layer (like Kafka/Kinesis) will come into play ?

What will be the advantages/disadvantages if streaming layer like Kafka/Kinesis) comes in between ?

In real world application you would use Kafka/Kinesis data source or a different data source that reads data from an external system.

Kafka is an alternative to existing message brokers like RabbitMQ and has excellent performance characteristics, but you can use other data sources in Flink, or even write your own.

Upvotes: 1

Use of Streaming in Hadoop

Answers (1)

Related Questions