Reputation: 21
I have a partitioned ORC table in Hive. After loading the table with all possible partitions I get on HDFS - multiple ORC files i.e. each partition directory on HDFS has an ORC file in it. I need to combine all these ORC files under each partition to a single big ORC file for some use-case.
Can someone suggest me a way to combine these multiple ORC files (belonging to each partition) into a single big ORC file.
I've tried creating a new Non Partitioned ORC table from the Partitioned table.. It does reduce the number of files but not to a single file.
PS: Creating a table out of another one is a completely a map task and hence setting the number of reducers to 1 using the property 'set mapred.reduce.tasks=1;' doesn't help.
Thanks
Upvotes: 1
Views: 1978
Reputation: 1584
You can use the CONCATENATE
command to combine the small orc files. This can be done at table as well as partition level:
The syntax as per the orc documentation:
users can request an efficient merge of small ORC files together by issuing a CONCATENATE command on their table or partition. The files will be merged at the stripe level without reserialization.
ALTER TABLE istari [PARTITION partition_spec] CONCATENATE;
Upvotes: 0