Why there is no in memory computation functionality in latest Hadoop?

Question

We all know that Spark uses RAM to store processed data, both Spark and Hadoop use RAM for computation, which makes Spark to access data at blazing fast speed. But if that is the one thing which makes a lot of difference (apart from Tungsten and Catalyst), we could have added it in Hadoop itself. Why we have not changed just the storing routine in Hadoop (making it in memory) instead of inventing a different tool (Apache Spark) altogether? Are there any other limitations that prevent Hadoop from implementing in memory storage?

Coursal · Accepted Answer

There are two main factors that determine the "choice" of using another platform altogether for faster computations (e.g. Spark) on top of Hadoop instead of reforming the way that the latter executes its applications.

1. Hadoop is more of an infrastructure than just a distributed computing library

And by no means do I imply that you can't use it as such to develop an application based on your needs by using the ways of the MapReduce paradigm. When we are talking about working in Hadoop, we are not just talking about a resource manager (YARN) or a distributed file system (HDFS) but we also have to include the ecosystem of products that are based or applicable to it (like Flume, Pig, Hive, and yes you guessed right Spark as well). Those modules act as extensions on top of Hadoop in order to make things easier and more flexible whenever the Hadoop MapReduce way of handling tasks and/or storing data on the disk gets troublesome.

There's a big chance you actually used Spark to run an application using its beautiful and thorough libraries while retrieving your data from a directory in HDFS and you can get that Hadoop is just the base of the platform on which your application is running. Whatever you can put on top of it is your choice and preference exclusively based on your needs.

2. Main Memory is much more expensive and complicated

There's a big relief you can have when you are developing an application in Hadoop while knowing that all of the processed data will always be stored in the disk of your system/cluster since you know that:

a) you will easily be able to point out what's sticking out like a sore thumb by looking at the inbetween and final process data by yourself and

b) you can easily support applications that will probably need from 500GB to 10-20TB (if we are talking about a cluster, I guess) but it's entirely different conversation if you can support heavy (and I mean heavy, like multiple GB of RAM) application memory-wise

This has to do with the whole scale-out way of scaling resources in projects like Hadoop, where instead of building a few powerful nodes that can take huge chunks of data to process, it is preferred to just add more less-powerful nodes that are built with common hardware specifications in mind. This is also one of the reasons that Hadoop is in someways still mistaken for a project that is centered around building small in-house data warehouses (but this is really a story for another time).

However I'm kind of obliged to say at this point that Hadoop is slowly being dropped in usage due to the latest trends since:

projects like Spark become more independent and approachable/user-friendly in the use of more complex stuff like machine learning applications (you can read this small and neat article about it where some reality checks are being given over here)
the infrastructure aspect of Hadoop is challenged by the use of Kubernetes containers instead of its YARN module, or Amazon's S3 which can actually replace HDFS altogether (but that doesn't mean that things are that bad about Hadoop just yet, you can take a taste of experimentation and the current state of things in this more broad and opinion-based article here)

In the end I believe that Hadoop will find its use for years onward, but everyone is also moving onward as well. The concepts of Hadoop are valuable to know and grasp, even if there might not even be any companies or enterprises that implement it because you will never really know if is going to be easier and more stable or not to develop something using Hadoop instead of a newer and slicker thing that everybody uses.

Why there is no in memory computation functionality in latest Hadoop?

Answers (2)

Related Questions