Tom
Tom

Reputation: 6342

How to write data to hive table with snappy compression in Spark SQL

I have an orc hive table that is created using Hive command

create table orc1(line string) stored as orcfile

I want to write some data to this table using spark sql, I use following code and want the data to be snappy compressed on HDFS

  test("test spark orc file format with compression") {
    import SESSION.implicits._
    Seq("Hello Spark", "Hello Hadoop").toDF("a").createOrReplaceTempView("tmp")
    SESSION.sql("set hive.exec.compress.output=true")
    SESSION.sql("set mapred.output.compress=true")
    SESSION.sql("set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec")
    SESSION.sql("set io.compression.codecs=org.apache.hadoop.io.compress.SnappyCodec")
    SESSION.sql("set mapred.output.compression.type=BLOCK")
    SESSION.sql("insert overwrite table orc1 select a from tmp  ")
  }

The data is written, but it is NOT compressed with snnapy.

If I run the insert overwrite in Hive Beeline/Hive to write the data and use the above set command , then I could see that the table's files are compressed with snappy.

So, I would ask how to write data with snappy compression in Spark SQL 2.1 to orc tables that are created by Hive

Upvotes: 1

Views: 2267

Answers (1)

randal25
randal25

Reputation: 1330

You can set the compression to snappy on the create table command like so

create table orc1(line string) stored as orc tblproperties ("orc.compress"="SNAPPY");

Then any inserts to the table will be snappy compressed (I corrected orcfile to orc in the command also).

Upvotes: 1

Related Questions