Hive results save as parquet file

Question

I'm trying to create a snappy.parquet file from a Hive table. Its a big partitioned table just need small part of it. Doing this:

set parquet.compression=SNAPPY;
set hive.exec.compress.output=true;
set hive.exec.compress.intermediate=true;
set hive.exec.parallel=true;
set mapred.output.compress=true;
set mapreduce.output.fileoutputformat.compress=true;
set mapred.compress.map.output=true;
set mapreduce.map.output.compress=true;
set mapred.output.compression.type=BLOCK;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
set io.seqfile.compression.type = BLOCK;
insert overwrite directory 'EXTERNAL_DIRECTORY' STORED AS PARQUET select * from SOURCE_TABLE;

It creates 000000_0 file with following schema:

message hive_schema {
optional int32 _col0;
optional binary _col1 (UTF8);
optional binary _col2 (UTF8);
optional binary _col3 (UTF8);
optional binary _col4 (UTF8);
optional binary _col5 (UTF8);
optional binary _col6 (UTF8);
optional binary _col7 (UTF8);
optional int64 _col8;
optional int64 _col9;
optional int64 _col10;
)

Loosing all the column names from SOURCE_TABLE. How do i save it properly so i can use it as hive table later?

Jared · Accepted Answer

I would create a new external table for your data set by selecting all the data from the source partitions you are after. Then you would have a table and file that can be utilized. You can't do a create table as select statement as of now with external tables, so you would need to create the table first and then load data into it.

create external table yourNewTable ( use your source table DDL...)
  stored as parquet location '/yourNewLocation';

insert into yourNewTable
  select * from yourSourceTable where yourPartitionedFieldNames = 'desiredPartitionName';

Hive results save as parquet file

Answers (1)

Related Questions