Shashwat Mishra
Shashwat Mishra

Reputation: 96

Strange behaviour in hbase bulk load

I am trying to bulk load 20k files into a hbase table. The average file size is 400kb. However some of the files are as large as 70mb. The total size of all files put together is 11gb. The approach is standard, emitting key value pairs followed a call to loadIncremenalFiles. When I run the code for a random sample of 10 files, everything works. I noted that the size of generated hfiles was 1.3 times the size of the files themselves. However when I run the same code for all 20k files, I get hfiles which, put together, are 400gb in size. 36 times as large as the data itself. HFiles contain indexes and metadata in addition to the table data, but even with that what can explain such dramatic increase in size?

Upvotes: 1

Views: 137

Answers (1)

Shashwat Mishra
Shashwat Mishra

Reputation: 96

I discovered the reason behind the dramatic increase in space.

This is what my mapper emitting key value pairs looked like (input was a Sequence File).

public void map(Text key, BytesWritable value, Context context)
....
byte[] row = Bytes.toBytes(rowID);
hKey.set(row);
kv=getKV(familyRaw, Bytes.toBytes("content"), value.getBytes());

The problem is in the call value.getBytes(). It returns a byte array padded with zeros. Changing it to value.copyBytes() fixed the behaviour.

This is discussed in HADOOP-6298

Upvotes: 1

Related Questions