Reputation: 394
I have one hive internal table which has around 500 million records. My hive is deployed on top of AWS EMR. I do not want to keep the AWS EMR always running. Hence I want to backup the hive internal table data.
One easy way of doing it to create an external table pointing to S3 Location and then moving all records into that external table using insert command. When ever I need internal table back, I can use this external S3 table to get all the data back.
Since this table only purpose is for backup, I want to ask which stored as format will be best choice for me.
Hive as of now supports following formats
TEXTFILE
SEQUENCEFILE
ORC
PARQUET
AVRO
RCFILE
Also is there any other way to backup your internal tables other than the approach mentioned above.
Upvotes: 1
Views: 208
Reputation: 35424
I'd think changing file format(the list you mentioned) will not have much difference in size. But file size and type of access you want on that file plays crucial role your cloud account billing.
So consider following,
Things to consider when choosing a solution, How much time you can buy
Here are some of the file formats with their decompression speed and space efficiency, pick the balanced(means time/space as per above questions) and available compression format for you.
more compress and compress benchmarks at
Upvotes: 1