Reputation: 33

How to determine the number of executors to read a delta table?

I have a delta table which is partitioned by multiple keys, one of which includes date excluding minute details(only upto hour, example - Fri, 15 Jul 2022 07)

Now, with the data keep ingesting via batch and streaming ingestion workflow, what would be the best strategy to evaluate number of executors to read all the data from delta table?

One of the very naive way could be to just let spark autoscale but we may still need to play with shuffle partitions etc. Looking for hints or best practices around the same. Thanks!

Upvotes: 0

Answers (1)

Jacek Laskowski

Reputation: 74709

If you want to "read all the data from delta table" it does not really matter whether this table is partitioned or not since the query reads all the data and hence loads the whole table.

This is the worst possible query - the dreaded full scan. If it's inevitable, just know that that is the kind of queries where Spark SQL shines so bright utilising the full power of a Spark cluster. You've been warned :)

Executors are simply machines with CPU cores and memory. You're probably more interested in the number of CPU cores for all the tasks to load the delta table.

I'd start this calculation with the number of files for a given version of the delta table. Files are of different size and (I might be wrong here) they are usually chunked (I don't want to use the overloaded term partitioned here, but that's what springs to my mind) to 512MB splits.

The number of splits (512MB blocks) for all the files of a given version of the delta table would be the number of tasks. That would give you the number of CPU cores and hence their "containers", i.e. Spark executors (to evenly saturate available physical resources for the best performance).

Upvotes: 1

How to determine the number of executors to read a delta table?

Answers (1)

Related Questions