Reputation: 4348
Creating a table from a Spark SQL SELECT, we end up generating too many files. How can we limit them?
Upvotes: 0
Views: 1905
Reputation: 4348
Starting from spark 2.4 you can hint the query to control the output:
INSERT ... SELECT /*+ COALESCE(numPartitions) */ ...
INSERT ... SELECT /*+ REPARTITION(numPartitions) */ ...
For example, this would generate 5 files:
CREATE TABLE business.clients
AS
SELECT /*+ REPARTITION(5) */
client_id,
country,
wallet
FROM business.users;
Before Spark 2.4, a way would be to limit the number of partitions for the whole query:
SET spark.sql.shuffle.partitions = 5;
But this would potentially impact the process performance.
More information here https://issues.apache.org/jira/browse/SPARK-24940
Upvotes: 2