Harbeer Kadian
Harbeer Kadian

Reputation: 394

Best storage format to backup hive internal table

I have one hive internal table which has around 500 million records. My hive is deployed on top of AWS EMR. I do not want to keep the AWS EMR always running. Hence I want to backup the hive internal table data.

One easy way of doing it to create an external table pointing to S3 Location and then moving all records into that external table using insert command. When ever I need internal table back, I can use this external S3 table to get all the data back.

Since this table only purpose is for backup, I want to ask which stored as format will be best choice for me.

Hive as of now supports following formats

TEXTFILE
SEQUENCEFILE
ORC
PARQUET
AVRO
RCFILE

Also is there any other way to backup your internal tables other than the approach mentioned above.

Upvotes: 1

Views: 208

Answers (1)

mrsrinivas
mrsrinivas

Reputation: 35424

In Short

I'd think changing file format(the list you mentioned) will not have much difference in size. But file size and type of access you want on that file plays crucial role your cloud account billing.

So consider following,

  1. Compression - To reduce the size
  2. Amazon Glacier - Cost effective solution than S3 in AWS, as the data is less likely to access (archival)

Things to consider when choosing a solution, How much time you can buy

  • To access file from archival storage.
  • to convert data format to Hive managed table (if you change during archival)
  • to data uncompress(each compression is trade of between time and size)

Extended answer

Here are some of the file formats with their decompression speed and space efficiency, pick the balanced(means time/space as per above questions) and available compression format for you.

uncompress chart

more compress and compress benchmarks at

Upvotes: 1

Related Questions