How does Apache Flink compare to Mapreduce on Hadoop?

Question

How does Apache Flink compare to Mapreduce on Hadoop? In what ways it's better and why?

Fabian Hueske · Accepted Answer

Disclaimer: I'm a committer and PMC member of Apache Flink.

Similar to Hadoop MapReduce, Apache Flink is a parallel data processor with an own API and execution engine. Flink aims to support many of the use cases that Hadoop is being used for and plays nicely with many systems from the Hadoop ecosystem including HDFS and YARN.

I will answer your question by distinguishing between the MapReduce programming model and the MapReduce execution model.

Programming Model

Apache Flink's programming model is based on concepts of the MapReduce programming model but generalizes it in several ways. Flink offers Map and Reduce functions but also additional transformations like Join, CoGroup, Filter, and Iterations. These transformations can be assembled in arbitrary data flows including multiple sources, sinks, and branching and merging flows. Flink's data model is more generic than MapReduce's key-value pair model and allows to use any Java (or Scala) data types. Keys can be defined on these data types in a flexible manner.

Consequently, Flink's programming model is a super set of the MapReduce programming model. It allows to define many programs in a much more convenient and concise way. I also want to point out that it is possible to embed unmodified Hadoop functions (Input/OutputFormats, Mapper, Reducers) in Flink programs and execute them jointly with native Flink functions.

Execution Model

Looking at the execution model, Flink borrows many concepts from parallel relational database systems. Flink features a pipelined processing model which reduces the need to materialize intermediate results on local or distributed filesystems (in additions this also allows Flink to do real-time stream processing). Moreover, the execution of a Flink program is not tightly coupled to the program's specification. In MapReduce (as done by Apache Hadoop), the execution of each MapReduce program follows exactly the same pattern. Flink programs are given to an optimizer which figures out an efficient execution plan. Similar to relational DBMS the optimizer chooses data shipping and join strategies in such a way that expensive operations such data shuffling and sorting are avoided. I should point out that Flink has not been tested at the massive scale-out that Hadoop is running on. I know of Flink setups that run on up to 200 nodes.

How does Apache Flink compare to Mapreduce on Hadoop?

Answers (2)

Related Questions