sparkDabbler
sparkDabbler

Reputation: 525

Spark-Hive partitioning

The Hive table was created using 4 partitions.

CREATE TABLE IF NOT EXISTS hourlysuspect ( cells int, sms_in int) partitioned by (traffic_date_hour string) stored as ORC into 4 buckets 

The following lines in the spark code insert data into this table

 hourlies.write.partitionBy("traffic_date_hour").insertInto("hourly_suspect")

and in the spark-defaults.conf, the number of parallel processes is 128

spark.default.parallelism=128

The problem is that when the inserts happen in the hive table, it has 128 partitions instead of 4 buckets. The defaultParallelism cannot be reduced to 4 as that leads to a very very slow system. Also, I have tried the DataFrame.coalesce method but that makes the inserts too slow.

Is there any other way to force the number of buckets to be 4 when the data is inserted into the table?

Upvotes: 1

Views: 1098

Answers (1)

Abhishek Madav
Abhishek Madav

Reputation: 31

As of today {spark 2.2.0} Spark does not support writing to bucketed hive tables natively using spark-sql. While creating the bucketed table, there should be a clusteredBy clause on one of the columns form the table schema. I don't see that in the specified CreateTable statement. Assuming, that it does exist and you know the clustering column, you could add the .bucketBy([colName]) API while using DataFrameWriter API.

More details for Spark2.0+: [Link] (https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/sql/DataFrameWriter.html)

Upvotes: 2

Related Questions