Reputation: 518
I am working with a large set of data stored in HBase. Many of the values stored in my columns are actually "vectors" of data -- multiple values. The way I've set out to handle storing multiple values is through a ByteBuffer
. Since I know the type of data stored in every column in my column families, I have written a series of classes extending a base class that wraps around ByteBuffer
and gives me an easy set of methods for reading individual values as well as appending additional values to the end. I have tested this class independently of my HBase project and it works as expected.
In order to update my database (nearly every row is updated in each update), I use a TableMapper
mapreduce job to iterate over every row in my database. Each of my mappers (in my cluster, there are six), loads the entire update file (rarely more than 50MB) into memory and then updates each row id as it iterates over it.
The problem I am encountering is that every time I pull a data value out of the Result
object, it has 4 bytes appended to the end of it. This makes things difficult for my update because I am not sure whether to expect this "padding" to be an extra 4 bytes every time or whether it could balloon out to something larger/smaller. Since I am loading this into my ByteBuffer
wrapper, it is important that there is no padding because that would cause there to be gaps in my data when I appended additional data points to it which would make it impossible to read them out later without error.
I've written up a test to confirm my hypothesis by creating a test table and class. The table only has one data point per column (a single double -- I have confirmed that the length of the bytes going in is 8) and I have written the following code to retrieve and examine it.
HTable table = new HTable("test");
byte[] rowId = Bytes.toBytes("myid");
Get get = new Get(rowId);
byte[] columnFamily = Bytes.toBytes("data");
byte[] column = Bytes.toBytes("column");
get.addColumn(columnFamily, column);
Result = table.get(get);
byte[] value = result.value();
System.out.printlin("Value size: " + value.length);
double doubleVal = Bytes.toDouble(value);
System.out.println("Fetch yielded: " + doubleVal);
byte[] test = new byte[8];
for (int i = 0; i < value.length - 4; i++)
blah[i] = value[i];
double dval = Bytes.toDouble(test);
System.out.println("dval: " + dval);
table.close()
Which results in:
Value size: 12
Fetch yielded: 0.3652
dval: 0.3652
These values are to be expected.
Any thoughts on how to tackle this problem? I'm aware of the existence of serialization engines like Avro but I'm trying to avoid using them for the time being and my data is so straightforward that I feel as though I shouldn't have to.
EDIT: I've continued onward, truncating my data by the greatest common multiple of my data type size. In my experience, these extra bytes are exclusively appended to the end of my byte[]
array. I've made a few classes that handle this automatically in a rather clean manner, but I'm still curious as to why this might be happening.
Upvotes: 3
Views: 436
Reputation: 590
I had a similar problem when importing data using MapReduce into HBase. There were junk bytes appended to my rowkeys, due to this code:
public class MyReducer extends TableReducer<Text, CustomWritable, Text> {
protected void reduce(Text key, Iterable<CustomWritable> values, Context context) throws IOException, InterruptedException {
// only get first value for the example
CustomWritable value = values.iterator().next();
Put put = new Put(key.getBytes());
put.add(columnFamily, columnName, value.getBytes());
context.write(outputKey, put);
}
}
The problem is that Text.getBytes() returns the actual byte array from the backend (see Text) and the Text object is reused by the MapReduce framework. So the byte array will have junk chars from previous values it held. This change fixed it for me:
Put put = new Put(Arrays.copyOf(key.getBytes(), key.getLength()));
If you're using Text as your value type in your job somewhere, it could be doing the same thing.
Upvotes: 2
Reputation: 20192
Is it a jdk7 vs. jdk6 issue? Are you in two different jvm versions?
could be related to something a playorm user ran into https://github.com/deanhiller/playorm/commit/5e6ede13477a60c2047daaf1f7a7ce55550b0289
Dean
Upvotes: 0