Add Variable Number of Columns for Column Family in HBase Using Pig

Question

I need to load from Pig to HBase using HBaseStorage and I can't figure out how to load with a variable number of columns for a specific column family.(With a known number of columns it is straightforward)

Data that looks like this: (spaces added for readibility)

Id,ItemId,Count,Date
1 ,1     ,2    ,2015-02-01
2 ,2     ,2    ,2015-02-02
3 ,1     ,2    ,2015-02-03

And I have an HBase table with rowkey and one column family called Attributes. So I load first the csv using:

A = LOAD 'items.csv' USING PigStorage(',') 
as (Id,ItemId,Count:chararray, CreationDate:chararray);

And now I want to group them by ItemId so I do the following:

B = FOREACH A GENERATE ItemId, TOTUPLE(Date, Count);

C = GROUP B BY ItemId

So I get my data nicely grouped, with the key and then the tuples with Date and Count:

1   {(2015-02-03, 2),(2015-02-01, 2)}
2   {(2015-02-02, 2)}

And what I am aiming for in HBase is to have one row with two columns, with the date and count:

Rowkey = 1 (Attributes.2015-02-03,2) (Attributes.2015-02-01,2)
Rowkey = 2 (Attributes.2015-02-02,2)

And this is the part I am struggling with, how do I define that I have a variable number of columns? I have tried the following as well as multiple other combinations:

STORE onlygroups into 'hbase://mytable'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('Attributes:*');

But get several errors, for example this one:

ERROR 2999: Unexpected internal error. org.apache.pig.data.InternalCachedBag 
    cannot be cast to java.util.Map

I have also tried using TOMAP but does not work either. Any suggestions?

Note: the recommended solution identified as duplicate does not solve my issue, it basically recommends using MapReduce and my data structure is different.

Add Variable Number of Columns for Column Family in HBase Using Pig

Answers (1)

Related Questions