Madhava Carrillo
Madhava Carrillo

Reputation: 4348

How can I control the number of output files from a Spark SQL query?

Creating a table from a Spark SQL SELECT, we end up generating too many files. How can we limit them?

Upvotes: 0

Views: 1905

Answers (1)

Madhava Carrillo
Madhava Carrillo

Reputation: 4348

Starting from spark 2.4 you can hint the query to control the output:

INSERT ... SELECT /*+ COALESCE(numPartitions) */ ...
INSERT ... SELECT /*+ REPARTITION(numPartitions) */ ...

For example, this would generate 5 files:

CREATE TABLE business.clients
AS 
SELECT /*+ REPARTITION(5) */
       client_id,
       country,
       wallet
FROM business.users;

Before Spark 2.4, a way would be to limit the number of partitions for the whole query:

SET spark.sql.shuffle.partitions = 5;

But this would potentially impact the process performance.

More information here https://issues.apache.org/jira/browse/SPARK-24940

Upvotes: 2

Related Questions