Reputation: 51
I have read in many places about Streaming data, but just trying to understand the challenges which are faced while processing it using Map Reduce technique?
i.e. the reason behind the existence of frameworks like Apache Flume, Apache Storm, etc.
Please share your advise & thoughts.
Thanks, Ranit
Upvotes: 0
Views: 509
Reputation: 433
Your question is open eneded. But I assume you want to understand the challenges of processing streaming data in Map Reduce environment.
1) Map Reduce is primarily designed for batch processing. It is for processing high volume of data which is at rest in disk.
2) The streaming data is a high velocity of data, which are coming from various sources like Web Application Click Stream, Social Media Logs, Twitter Tags, Application logs.
3) The stream of events might be processed either stateless manner ( assuming every event is unique) or in a stateful manner (collect the data for 2 seconds and processes them) but batch applications does not have any such requirement.
4) Streaming applications wants delivery / process guarantee. For example, the frameworks must provide "exactly once" delivery/process mechanism, so that it processes all the stream events without fail. It is not a challenge in batch processing since all the data is available locally.
5) External Connectors : Streaming frameworks must support external connectivity to read data in realtime from various sources as we discussed in (2). This is not a challenge in batch, since the data is locally available.
Hope this helps.
Upvotes: 0
Reputation: 20798
There are many technologies out there, and many of them run on the Hadoop framework.
The older Hadoop services like Hive tend to be slow, and are usually used for batch jobs, not for streaming. As streaming becomes more and more a necessity, other services have surfaced like Storm or Spark that are designed for faster execution and integration with messaging queues like Kafka for streaming.
In data analytics though, most of the time processing is not al real time: historical data may be processed in batch mode to extract models that are then used for real-time analytics, so a 'streaming' system is usually based on a Lambda Architecture http://lambda-architecture.net/
A service like Spark tries to integrate all of the components, with Spark Streaming for the speed layer, Spark SQL for the Serving layer, Spark MLLib for the modeling, all based on Hadoop Distributed File system (hdfs) for replicated large volume storage.
Flume helps in directing the data from source to hdfs for raw storage, but in order to process it, Storm or Spark are used.
Hope that helps.
Upvotes: 1