brent
brent

Reputation: 1105

Spark Job estimated overhead on smaller datasets

Are there any rules of thumb - for when the data size is sufficient to offset the overhead that spark processing requires?

I'm working on between 1 and 10 million records. Each record carries 5 Long ids; and a small (less than 5000 characters) amount of text.

The work load is to create reports - so filter; group and aggregate. In most cases; the top level aggregation will be over all the records; so at some point in the report generation - I don't have a good partition key to work with.

Aware the question is low on specifics; but does this jump off the page that I'm doing lots of silly things in Spark? Or would spark job orchestration be likely to add that kind of overhead; and I would be better off looking to only use Spark on larger datasets?

Thanks

Upvotes: 0

Views: 450

Answers (1)

brent
brent

Reputation: 1105

The most informative piece of docs I came across was

Spark can efficiently support tasks as short as 200 ms https://spark.apache.org/docs/2.1.0/tuning.html

Upvotes: 0

Related Questions