Despicable me
Despicable me

Reputation: 700

Increase write speed in hive for ORC files

Currently an insert overwrite table T1 select * from T2; will take around 100 minutes in my cluster. Table T1 is ORC formatted and T2 is text formatted. I am reading a 60 GB of text data from T2 and inserting into ORC table T1(10 GB after insertion). If i use text format for both tables insert will take around 50 min. In both cases what are the things we can do to improve write speed( I have large tables coming in) or any other suggestions??

Upvotes: 3

Views: 2703

Answers (2)

A.r. Naresh
A.r. Naresh

Reputation: 41

I have recently derived an approach which splits the source file into partitions this takes around 6mins from text table to orc table in hive for 100GB data. Approach below

Before inserting the file into text table

1.split the file into small partitions in unix location using split command

2.then remove the original file from the path and just keep the files splitted.

Inserting into text table

3.now load the data into text table

4.it will take some mins to load and u can see that there will be same number of partitions as you have done at unix level

Inserting into orc table

  1. Ex: you have splitted the actual file into let say 20 partitions then you would see 20 tasks/containers being run on the cluster to load into the orc table which is very much faster than the other solutions which i came across

@despicable-me

Upvotes: 2

Alex
Alex

Reputation: 8937

That is probably a normal behaviour as when you write data from text to text - it just writes data line by line from one file into another. Text-to-ORC will do some more work besides of it. Comparing to the text-to-text operation, text-to-orc importing will perform additional bucket-partition operations and compression operations to you data. That is the resaon of your time impacts. ORC format gives two main benefits upon text format:

  • save of space due to compression
  • improve access time to work with the data

Usually the INSERT operation is a single time operation, while access operations will be very frequent. So it usually makes sence to spend some more time at the beginning on importing the data and then have a huge benefite in saving space due to optimized storage of the data and in optimized access time to this data

Upvotes: 1

Related Questions