Reputation: 700
Currently an insert overwrite table T1 select * from T2;
will take around 100 minutes in my cluster. Table T1 is ORC
formatted and T2 is text
formatted. I am reading a 60 GB of text data from T2 and inserting into ORC table T1(10 GB after insertion). If i use text format for both tables insert will take around 50 min. In both cases what are the things we can do to improve write speed( I have large tables coming in) or any other suggestions??
Upvotes: 3
Views: 2703
Reputation: 41
I have recently derived an approach which splits the source file into partitions this takes around 6mins from text table to orc table in hive for 100GB data. Approach below
Before inserting the file into text table
1.split the file into small partitions in unix location using split command
2.then remove the original file from the path and just keep the files splitted.
Inserting into text table
3.now load the data into text table
4.it will take some mins to load and u can see that there will be same number of partitions as you have done at unix level
Inserting into orc table
@despicable-me
Upvotes: 2
Reputation: 8937
That is probably a normal behaviour as when you write data from text to text - it just writes data line by line from one file into another. Text-to-ORC will do some more work besides of it. Comparing to the text-to-text operation, text-to-orc importing will perform additional bucket-partition operations and compression operations to you data. That is the resaon of your time impacts. ORC format gives two main benefits upon text format:
Usually the INSERT operation is a single time operation, while access operations will be very frequent. So it usually makes sence to spend some more time at the beginning on importing the data and then have a huge benefite in saving space due to optimized storage of the data and in optimized access time to this data
Upvotes: 1