Spark-Hive partitioning

Question

The Hive table was created using 4 partitions.

CREATE TABLE IF NOT EXISTS hourlysuspect ( cells int, sms_in int) partitioned by (traffic_date_hour string) stored as ORC into 4 buckets

The following lines in the spark code insert data into this table

 hourlies.write.partitionBy("traffic_date_hour").insertInto("hourly_suspect")

and in the spark-defaults.conf, the number of parallel processes is 128

spark.default.parallelism=128

The problem is that when the inserts happen in the hive table, it has 128 partitions instead of 4 buckets. The defaultParallelism cannot be reduced to 4 as that leads to a very very slow system. Also, I have tried the DataFrame.coalesce method but that makes the inserts too slow.

Is there any other way to force the number of buckets to be 4 when the data is inserted into the table?

Spark-Hive partitioning

Answers (1)

Related Questions