VikasG
VikasG

Reputation: 579

Data size increases in hbase

I am trying to import data from MySQL to HBase using sqoop. There are about 9 million records in the MySQL table, size being nearly 1.2GB. The replication factor of the hadoop cluster is three.
Here are the issues I am facing:

  1. The data size after import to hbase is more than 20 GB!!! Ideally it should be close to, say 5GB (1.2G*3 + some overhead)

  2. The HBase table has VERSIONS defined as 1. In case I import the same table again from MySQL, the file size in /hbase/ increases (almost doubles). Although the row count in HBase tables remains same. This seems weird as I am inserting the same rows in HBase, hence the filesize should remain the same, similar to the row count value.

As far as my understanding goes, the file size in the second case shouldn't increase if I am importing the same rowset as max version maintained for each entry should be one only.

Any help would be highly appreciated.

Upvotes: 0

Views: 2038

Answers (2)

Arnon Rotem-Gal-Oz
Arnon Rotem-Gal-Oz

Reputation: 25909

The "some overhead" can get quite big in HBase as each value also stores the key, the family , the qualifier, the timestamp, the version and the value itself - you should strive to make the key, family and qualifier as short as possible.

Additionally you may want to use compression - Snappy is a good option (you can see this post for a comparison between compressions)

regarding your second question. when you copy the table again you get another copy of each value. the other versions will be cleared after compaction. This is because HBase stores its data in Hadoop so once written the files are read-only. compaction creates new files which contains only the needed data and deletes unneeded data/files

Upvotes: 1

Woot4Moo
Woot4Moo

Reputation: 24316

It depends, according to this blog

So to calculate the record size: Fixed part needed by KeyValue format = Key Length + Value Length + Row Length + CF Length + Timestamp + Key Value = ( 4 + 4 + 2 + 1 + 8 + 1) = 20 Bytes

Variable part needed by KeyValue format = Row + Column Family + Column Qualifier + Value

Total bytes required = Fixed part + Variable part

So for the above example let's calculate the record size: First Column = 20 + (4 + 4 + 10 + 3) = 41 Bytes Second Column = 20 + (4 + 4 + 9 + 3) = 40 Bytes Third Column = 20 + (4 + 4 + 8 + 6) = 42 Bytes

Total Size for the row1 in above example = 123 Bytes

To Store 1 billion such records the space required = 123 * 1 billion = ~ 123 GB

I would presume your calculations are grossly incorrect, perhaps share your schema design with us and we can work out the math.

Upvotes: 3

Related Questions