sabrina
sabrina

Reputation: 111

HBase table size is much bigger than the file in hadoop hdfs

Recently I use hadoop bulk load to put data into hbase Firstly, I call HDFS API to write data into file in hadoop hdfs, totally 7000,000 lines data, the size is 503MB. Secondly, I use org.apache.hadoop.hbase.mapreduce.ImportTsv and org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles to put data into hbase.

The most import things that I did is using bulkload tool to put data into hbase,after finished bulkload, I found that the hbase table is 1.96GB. The hdfs replication is 1. I do not know why.

Upvotes: 3

Views: 3503

Answers (1)

Donald Miner
Donald Miner

Reputation: 39943

There is a bit of overhead in storing the data since you have to store the names of the column qualifiers and such, but not 4x overhead. I have a few ideas, but definitely wouldn't mind hearing more details on the nature of the data and perhaps the stats on the table.

  • Do you have compression turned on in your tables? If the data was compressed in HDFS, but then after you load it, it is not compressed, that could cause an issue.
  • Maybe HBase for whatever reason isn't honoring your replication factor. Go do a hadoop fs -dus /path/to/hbase/table/data and see what that returns.
  • Are your column qualifiers pretty big? For example, colfam1:abc is pretty small and won't take up much space, but colfam1:abcdefghijklmnopqrstuvwxyz is going to take up quite a bit of space in the grand scheme of things!

Upvotes: 3

Related Questions