nano7
nano7

Reputation: 2493

Java: Hadoop: MapReduce: using filters for retrieving data from hbase, int/string comparison

I want to retrieve data from hbase for my mapreduce job, but I want to filter it before. I only want to retrieve the data, which contains a column with a id which is bigger or equal than a minId.

Im storing the Id in HBase as a string. Now I wonder if using this filter does work then.

int minId = 123; Filter filter = new ValueFilter(CompareFilter.CompareOp.GREATER_OR_EQUAL, new BinaryComparator(Bytes.toBytes(minId)));

How can HBase filter my data, when the ID which is stored is a String, but the value used to compare the data is an int? Can this work? If I use a String for my BinaryComparator (so String mindId = "123"; would this work then?

Thanks for answers!

Upvotes: 0

Views: 1384

Answers (1)

Hari Menon
Hari Menon

Reputation: 35435

HBase string filter uses lexical comparison. So, this would work only if the no. of digits in all ids is the same. One thing you can do is to zero pad the IDs.

So "123" > "121", but "123" < "21". If you zero pad it, it becomes "123" and "021" and then you will get the right result.

Another idea can be to create a comparator to match your requirements. Just override the BinaryComparators compareTo() method. May be something like this (I am just editing the compareTo method in PureJavaComparator):

  @Override
  public int compareTo(byte[] buffer1, int offset1, int length1,
      byte[] buffer2, int offset2, int length2) {
    // Remove leading zeros
    int l1 = getNumLeadingZeros(buffer1, offset1, length1);
    int l2 = getNumLeadingZeros(buffer2, offset2, length2);
    offset1=offset1+l1;
    length1=length1-l1;
    offset2=offset2+l2;
    length2=length2-l2;

    // If lengths are different, just return the longer int
    int ldiff = length1-length2;
    if(ldiff != 0) return ldiff;

    // If lengths are same, we can use the usual lexical comparator
    return Bytes.compareTo(buffer1, offset1, length1, buffer2, offset2, length2);
  }

  public int getNumLeadingZeros(byte[] arr, int offset, int length) {
      int ret = 0;
      byte zero = '0';
      int i=0;
      while(i<length && arr[offset+i]==zero) {
          ++ret;
      }
      return ret;
  }

It's not super-optimized, and it assumes there are no bad values. You can slip the leading zeros thing also if you are sure there won;t be anything like that. I have not tested it, so try it out and let me know if it worked!

Upvotes: 1

Related Questions