Reputation: 1105
Are there any rules of thumb
- for when the data size is sufficient to offset the overhead that spark processing requires?
I'm working on between 1 and 10 million records. Each record carries 5 Long
ids; and a small (less than 5000 characters) amount of text.
The work load is to create reports - so filter; group and aggregate. In most cases; the top level aggregation will be over all the records; so at some point in the report generation - I don't have a good partition key to work with.
Aware the question is low on specifics; but does this jump off the page that I'm doing lots of silly things in Spark? Or would spark job orchestration be likely to add that kind of overhead; and I would be better off looking to only use Spark on larger datasets?
Thanks
Upvotes: 0
Views: 450
Reputation: 1105
The most informative piece of docs I came across was
Spark can efficiently support tasks as short as 200 ms
https://spark.apache.org/docs/2.1.0/tuning.html
Upvotes: 0