Reputation: 568
I need to create a Hive table from Spark SQL which will be in the PARQUET format and SNAPPY compression. The following code creates table in PARQUET format, but with GZIP compression:
hiveContext.sql("create table NEW_TABLE stored as parquet tblproperties ('parquet.compression'='SNAPPY') as select * from OLD_TABLE")
But in the Hue "Metastore Tables" -> TABLE -> "Properties" it still shows:
| Parameter | Value |
| ================================ |
| parquet.compression | SNAPPY |
If I change SNAPPY to any other string e.g. ABCDE the code still works fine with exception that compression is still GZIP:
hiveContext.sql("create table NEW_TABLE stored as parquet tblproperties ('parquet.compression'='ABCDE') as select * from OLD_TABLE")
And Hue "Metastore Tables" -> TABLE -> "Properties" shows:
| Parameter | Value |
| ================================ |
| parquet.compression | ABCDE |
This make me think that TBLPROPERTIES are just ignored by Spark SQL.
Note: I tried to run the same query directly from Hive and in case when the property was equals to SNAPPY table was created successfully with proper compression (i.e. SNAPPY not GZIP).
create table NEW_TABLE stored as parquet tblproperties ('parquet.compression'='ABCDE') as select * from OLD_TABLE
In case when the property was ABCDE the query didn't failed, but table wasn't been created.
Question is what is the problem?
Upvotes: 4
Views: 4094
Reputation: 5782
This is the combo that worked for me (Spark 2.1.0):
spark.sql("SET spark.sql.parquet.compression.codec=GZIP")
spark.sql("CREATE TABLE test_table USING PARQUET PARTITIONED BY (date) AS SELECT * FROM test_temp_table")
Verified in HDFS:
/user/hive/warehouse/test_table/date=2017-05-14/part-00000-uid.gz.parquet
Upvotes: 4
Reputation: 9067
Straight from Spark documentation
When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance.
Then just below, you will find some properties that control whether Spark enforces all Hive options (and performance...) i.e. spark.sql.hive.convertMetastoreParquet
, and how to handle raw read/write on Parquet files such as spark.sql.parquet.compression.codec
(gzip by default - you should not be surprised) or spark.sql.parquet.int96AsTimestamp
.
Anyway, the "default compression" properties are just indicative. Within the same table and directory, each Parquet file may have its own compression settings -- and page size, HDFS block size, etc.
Upvotes: 2