Shu
Shu

Reputation: 153

How does Apache Flink compare to Mapreduce on Hadoop?

How does Apache Flink compare to Mapreduce on Hadoop? In what ways it's better and why?

Upvotes: 14

Views: 4633

Answers (2)

Stephan Ewen
Stephan Ewen

Reputation: 2371

Adding to Fabian's answer:

One more difference is that Flink is not a pure batch-processing system, but can at the same time to low-latency streaming analysis and offers a nice API to define streaming analysis programs.

Internally, Flink is actually a streaming system. To Flink, Batch programs are a special case of streaming programs.

Upvotes: 7

Fabian Hueske
Fabian Hueske

Reputation: 18987

Disclaimer: I'm a committer and PMC member of Apache Flink.

Similar to Hadoop MapReduce, Apache Flink is a parallel data processor with an own API and execution engine. Flink aims to support many of the use cases that Hadoop is being used for and plays nicely with many systems from the Hadoop ecosystem including HDFS and YARN.

I will answer your question by distinguishing between the MapReduce programming model and the MapReduce execution model.

Programming Model

Apache Flink's programming model is based on concepts of the MapReduce programming model but generalizes it in several ways. Flink offers Map and Reduce functions but also additional transformations like Join, CoGroup, Filter, and Iterations. These transformations can be assembled in arbitrary data flows including multiple sources, sinks, and branching and merging flows. Flink's data model is more generic than MapReduce's key-value pair model and allows to use any Java (or Scala) data types. Keys can be defined on these data types in a flexible manner.

Consequently, Flink's programming model is a super set of the MapReduce programming model. It allows to define many programs in a much more convenient and concise way. I also want to point out that it is possible to embed unmodified Hadoop functions (Input/OutputFormats, Mapper, Reducers) in Flink programs and execute them jointly with native Flink functions.

Execution Model

Looking at the execution model, Flink borrows many concepts from parallel relational database systems. Flink features a pipelined processing model which reduces the need to materialize intermediate results on local or distributed filesystems (in additions this also allows Flink to do real-time stream processing). Moreover, the execution of a Flink program is not tightly coupled to the program's specification. In MapReduce (as done by Apache Hadoop), the execution of each MapReduce program follows exactly the same pattern. Flink programs are given to an optimizer which figures out an efficient execution plan. Similar to relational DBMS the optimizer chooses data shipping and join strategies in such a way that expensive operations such data shuffling and sorting are avoided. I should point out that Flink has not been tested at the massive scale-out that Hadoop is running on. I know of Flink setups that run on up to 200 nodes.

Upvotes: 17

Related Questions