How can I control the number of output files from a Spark SQL query?

Question

Creating a table from a Spark SQL SELECT, we end up generating too many files. How can we limit them?

Madhava Carrillo · Accepted Answer

Starting from spark 2.4 you can hint the query to control the output:

INSERT ... SELECT /*+ COALESCE(numPartitions) */ ...
INSERT ... SELECT /*+ REPARTITION(numPartitions) */ ...

For example, this would generate 5 files:

CREATE TABLE business.clients
AS 
SELECT /*+ REPARTITION(5) */
       client_id,
       country,
       wallet
FROM business.users;

Before Spark 2.4, a way would be to limit the number of partitions for the whole query:

SET spark.sql.shuffle.partitions = 5;

But this would potentially impact the process performance.

Answers (1)