Reputation: 167
Everywhere I try to understand spark it says it is fast because it keeps data in memory as opposed to map reduce. Lets take this examples -
I have a 5 node spark cluster, with 100 GB RAM each. Lets say I have 500 TB of data to run a spark job against. Now total data that spark can keep is 100*5=500 GB. If It can keep max of 500 GB of data only in memory at any point of time, what makes it lightning fast ??
Upvotes: 3
Views: 1141
Reputation: 27443
Spark isn't magical and can't change fundamental principles of computing. Spark uses memory as a progressive enhancement and will fall back to disk I/O for huge datasets that can not be kept in memory. In a scenario where tables must be scanned from disks, spark performance should be comparable to other parallel solutions involving table scanning from disk.
Suppose only 0.1% of the 500 TB is "interesting". For instance, in a marketing funnel there are a lot of ad impressions, fewer clicks, even fewer sales, and less repeat sales. A program can filter through a huge dataset and tell Spark to cache in memory a smaller, filtered and corrected dataset needed for further processing. Spark caching of a smaller filtered data set is obviously much faster than repeated disk table scans and repeated processing of the larger raw data.
Upvotes: 10