Roman Nikitchenko
Roman Nikitchenko

Reputation: 13046

How to index HBase columns with binary data as SOLR fields?

I need to index my data stored in HBase rows. Obvious solution is to use Lily HBase indexer through replication and push results into SOLR collection.

The root of my problem is I have some 'short binary' columns in my HBase rows like MD5, CRC64, UUID and alike. Of course I store them as raw byte[] representation which saves me lot of space. But I need to index data based on some of such criteria storing actual representation. How to do so in correct way?

Upvotes: 2

Views: 1491

Answers (1)

Roman Nikitchenko
Roman Nikitchenko

Reputation: 13046

Received working solution:

  • Lily HBase indexer is configured for row mapping type. The result is document ID (unique key) being HBase row key.
  • HBase row key with binary data is formatted in this case based on Lily HBase indexer configuration where unique key formatter is set to 'com.ngdata.hbaseindexer.uniquekey.HexUniqueKeyFormatter`. This resulted document ID ('id') SOLR field as sequence of lowercase hex digit string matching row key binary representation. Probably can be better but at least works as expected. Note 'id' SOLR field is of type string here.
  • Binary cells are transformed by Morphline based on extractHBaseCells command from Cloudera Search. Mapping with type byte[] is used which happened to produce exactly Base64 encoded fields.

UPDATE 1:

  • Added HBASE_INDEXER_CLASSPATH environment configuration for HBase indexer and additional class extending com.ngdata.hbaseindexer.uniquekey.BaseUniqueKeyFormatter which now performs Base64 encoding for unique key so it can be declared as BinaryField. This finally did ALL things I demand from indexer. So now SOLR receives correct 'update' requests with Base64-encoded 'id' field and fields mapped from other needed columns.

UPDATE 2:

  • After played enough with solr.BinaryField I came to just plain solr.StrField for everything that I need to index AS IS. In case of binary bytes strings like hashes they are transformed into sequence of lowercase hex digits, 2 digits per byte. Maybe not the best in term of performance but looks most portable and flexible. For 'just stored' fields I already have Base64 encoder but I don't fields in SOLR if I don't index them.

Upvotes: 3

Related Questions