Reputation: 655
We have a requirement where a calculation must be done in near real time (with in 100ms at most) and involves moderately complex computation which can be parallelized easily. One of the options we are considering is to use spark in batch mode apart from Apache Hadoop YARN. I've read that submitting batch jobs to spark has huge overhead however. Is these a way we can reduce/eliminate this overhead?
Upvotes: 0
Views: 399
Reputation: 2745
Spark best utilizes available resources i.e. memory and cores. Spark uses the concept of Data Locality.
If data and the code that operates on it are together than computation tends to be fast. But if code and data are separated, one must move to the other. Typically it is faster to ship serialized code from place to place than a chunk of data because code size is much smaller than data. If you are low on resources surely scheduling and processing time will shoot. Spark builds its scheduling around this general principle of data locality.
Spark prefers to schedule all tasks at the best locality level, but this is not always possible. Check https://spark.apache.org/docs/1.2.0/tuning.html#data-locality
Upvotes: 1