Andrey Dmitriev
Andrey Dmitriev

Reputation: 568

Spark SQL ignores parquet.compression propertie specified in TBLPROPERTIES

I need to create a Hive table from Spark SQL which will be in the PARQUET format and SNAPPY compression. The following code creates table in PARQUET format, but with GZIP compression:

hiveContext.sql("create table NEW_TABLE stored as parquet tblproperties ('parquet.compression'='SNAPPY') as select * from OLD_TABLE")

But in the Hue "Metastore Tables" -> TABLE -> "Properties" it still shows:

|  Parameter            |  Value   |
| ================================ |
|  parquet.compression  |  SNAPPY  |

If I change SNAPPY to any other string e.g. ABCDE the code still works fine with exception that compression is still GZIP:

hiveContext.sql("create table NEW_TABLE stored as parquet tblproperties ('parquet.compression'='ABCDE') as select * from OLD_TABLE")

And Hue "Metastore Tables" -> TABLE -> "Properties" shows:

|  Parameter            |  Value   |
| ================================ |
|  parquet.compression  |  ABCDE   |

This make me think that TBLPROPERTIES are just ignored by Spark SQL.

Note: I tried to run the same query directly from Hive and in case when the property was equals to SNAPPY table was created successfully with proper compression (i.e. SNAPPY not GZIP).

create table NEW_TABLE stored as parquet tblproperties ('parquet.compression'='ABCDE') as select * from OLD_TABLE

In case when the property was ABCDE the query didn't failed, but table wasn't been created.

Question is what is the problem?

Upvotes: 4

Views: 4094

Answers (2)

Garren S
Garren S

Reputation: 5782

This is the combo that worked for me (Spark 2.1.0):

spark.sql("SET spark.sql.parquet.compression.codec=GZIP")
spark.sql("CREATE TABLE test_table USING PARQUET PARTITIONED BY (date) AS SELECT * FROM test_temp_table")

Verified in HDFS:

/user/hive/warehouse/test_table/date=2017-05-14/part-00000-uid.gz.parquet

Upvotes: 4

Samson Scharfrichter
Samson Scharfrichter

Reputation: 9067

Straight from Spark documentation

When reading from and writing to Hive metastore Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for better performance.

Then just below, you will find some properties that control whether Spark enforces all Hive options (and performance...) i.e. spark.sql.hive.convertMetastoreParquet, and how to handle raw read/write on Parquet files such as spark.sql.parquet.compression.codec (gzip by default - you should not be surprised) or spark.sql.parquet.int96AsTimestamp.

Anyway, the "default compression" properties are just indicative. Within the same table and directory, each Parquet file may have its own compression settings -- and page size, HDFS block size, etc.

Upvotes: 2

Related Questions