Strange behaviour in hbase bulk load

Question

I am trying to bulk load 20k files into a hbase table. The average file size is 400kb. However some of the files are as large as 70mb. The total size of all files put together is 11gb. The approach is standard, emitting key value pairs followed a call to loadIncremenalFiles. When I run the code for a random sample of 10 files, everything works. I noted that the size of generated hfiles was 1.3 times the size of the files themselves. However when I run the same code for all 20k files, I get hfiles which, put together, are 400gb in size. 36 times as large as the data itself. HFiles contain indexes and metadata in addition to the table data, but even with that what can explain such dramatic increase in size?

Strange behaviour in hbase bulk load

Answers (1)

Related Questions