Sumit D
Sumit D

Reputation: 171

Hive bucketing through sparkSQL

I have one doubt regarding bucketing in hive. I have created one temporary table which is bucketed on column key.

Through spark SQL I am inserting data into this temporary table. I have enabled the hive.enforce.bucketing to true in spark session.

When I check the base directory for this table, it is showing the file name prefixed with part_*.

However when I am manually inserting the data into this table through some another table, I am seeing the the files prefixed 00000_*.

I am not sure if spark sql is writing the data in buckets.

Can someone please help.

Thanks,

Upvotes: 6

Views: 5844

Answers (2)

jmng
jmng

Reputation: 2568

While Spark (in versions <= 2.4, at least) doesn't directly support Hive's bucketing format, it is possible to get Spark to output bucketed data that is readable by Hive, by using SparkSQL to load the data into a Hive table:

//enable Hive support when creating/configuring the spark session
val spark = SparkSession.builder().enableHiveSupport().getOrCreate()

//register DF as view that can be used with SparkSQL
val testDF = Seq((1, "a"),(2, "b"),(3, "c")).toDF("number", "letter")
testDF.createOrReplaceTempView("testDF")

//create Hive table, can also be done manually, e.g. via Hive CLI
val createTableSQL = "CREATE TABLE testTable (number int, letter string) CLUSTERED BY number INTO 1 BUCKETS STORED AS PARQUET"
spark.sql(createTableSQL)

//load data from DF into Hive, output parquet files will be bucketed and readable by Hive
spark.sql("INSERT INTO testTable SELECT * FROM testDF")

Upvotes: 1

Ged
Ged

Reputation: 18003

A confusing area.

I found this some time ago:

However, Hive bucketed tables are supported from Spark 2.3 onwards. Spark normally disallow users from writing outputs to Hive Bucketed tables. Setting hive.enforce.bucketing=false and hive.enforce.sorting=false will allow you to save to Hive Bucketed tables.

In Spark's JIRA: https://issues.apache.org/jira/browse/SPARK-17729

Hive allows inserting data to bucketed table without guaranteeing bucketed and sorted-ness based on these two configs : hive.enforce.bucketing and hive.enforce.sorting.

With this jira, Spark still won't produce bucketed data as per Hive's bucketing guarantees, but will allow writes IFF user wishes to do so without caring about bucketing guarantees. Ability to create bucketed tables will enable adding test cases to Spark while pieces are being added to Spark have it support hive bucketing (eg. https://github.com/apache/spark/pull/15229)

But from the definitive source https://spark.apache.org/docs/latest/sql-migration-guide-hive-compatibility.html#unsupported-hive-functionality the following:

Unsupported Hive Functionality Below is a list of Hive features that we don’t support yet. Most of these features are rarely used in Hive deployments. Major Hive Features Tables with buckets: bucket is the hash partitioning within a Hive table partition. Spark SQL doesn’t support buckets yet.

So to answer your question: you are getting the Spark approach to Hive Bucketing which is an approximation and thus not really the same thing.

Upvotes: 8

Related Questions