Reputation: 137
I am currently beginning to use HBase & HDFS to store my data. My first experiment is to import data in a 12GB txt file (no compress) to HBase 'test' table which replication is set to 3 (HDFS server).
But to my surprise, after importing, NameNode reports (50070) said "DFS Used: 390.93GB"! I just imported 12GB data, but HBase ate up my 390GB space. It doesn't make any sense to me. So can anybody shed some light on how can I troubleshooting this issue?
Upvotes: 0
Views: 151
Reputation: 116
The first thing that comes to mind is how you store your data in your column-families. Specifically, if you read a row of data from your file and then store that into a column-family of N-columns with length M(i), then, at a minimum you will incur the overhead of SUM( M(i) ) for 1 <= i <= N + ( headers + timestamps / etc ) for each row. Name your columns with small names if you want to save some space.
Other than that, there are the WAL logs, intermediate splits, and non-merged data-files ( that are not fully compacted/merged ). For example, if you import your data a couple of times ( for whatever reason, let's say your import failed, or you stopped it in the middle to start over because you thought a better/faster way to do something ), then, that data also lives in H-Files until a compaction runs.
If you think that the column-naming is not what is eating your space, then try running a major-compaction on your column-family / table. Wait until all tasks are done running and check your footprint again...
Hope that provides some insight? Good luck!
Upvotes: 1