Spark Job estimated overhead on smaller datasets

Question

Are there any rules of thumb - for when the data size is sufficient to offset the overhead that spark processing requires?

I'm working on between 1 and 10 million records. Each record carries 5 Long ids; and a small (less than 5000 characters) amount of text.

The work load is to create reports - so filter; group and aggregate. In most cases; the top level aggregation will be over all the records; so at some point in the report generation - I don't have a good partition key to work with.

Aware the question is low on specifics; but does this jump off the page that I'm doing lots of silly things in Spark? Or would spark job orchestration be likely to add that kind of overhead; and I would be better off looking to only use Spark on larger datasets?

Thanks

Spark Job estimated overhead on smaller datasets

Answers (1)

Related Questions