Reputation: 6342
I have an orc hive table that is created using Hive command
create table orc1(line string) stored as orcfile
I want to write some data to this table using spark sql, I use following code and want the data to be snappy compressed on HDFS
test("test spark orc file format with compression") {
import SESSION.implicits._
Seq("Hello Spark", "Hello Hadoop").toDF("a").createOrReplaceTempView("tmp")
SESSION.sql("set hive.exec.compress.output=true")
SESSION.sql("set mapred.output.compress=true")
SESSION.sql("set mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec")
SESSION.sql("set io.compression.codecs=org.apache.hadoop.io.compress.SnappyCodec")
SESSION.sql("set mapred.output.compression.type=BLOCK")
SESSION.sql("insert overwrite table orc1 select a from tmp ")
}
The data is written, but it is NOT
compressed with snnapy.
If I run the insert overwrite
in Hive Beeline/Hive to write the data and use the above set command
, then I could see that the table's files are compressed with snappy.
So, I would ask how to write data with snappy compression in Spark SQL 2.1 to orc tables that are created by Hive
Upvotes: 1
Views: 2267
Reputation: 1330
You can set the compression to snappy on the create table command like so
create table orc1(line string) stored as orc tblproperties ("orc.compress"="SNAPPY");
Then any inserts to the table will be snappy compressed (I corrected orcfile
to orc
in the command also).
Upvotes: 1