xmorera
xmorera

Reputation: 1961

Add Variable Number of Columns for Column Family in HBase Using Pig

I need to load from Pig to HBase using HBaseStorage and I can't figure out how to load with a variable number of columns for a specific column family.(With a known number of columns it is straightforward)

Data that looks like this: (spaces added for readibility)

Id,ItemId,Count,Date
1 ,1     ,2    ,2015-02-01
2 ,2     ,2    ,2015-02-02
3 ,1     ,2    ,2015-02-03

And I have an HBase table with rowkey and one column family called Attributes. So I load first the csv using:

A = LOAD 'items.csv' USING PigStorage(',') 
as (Id,ItemId,Count:chararray, CreationDate:chararray);

And now I want to group them by ItemId so I do the following:

B = FOREACH A GENERATE ItemId, TOTUPLE(Date, Count);

C = GROUP B BY ItemId

So I get my data nicely grouped, with the key and then the tuples with Date and Count:

1   {(2015-02-03, 2),(2015-02-01, 2)}
2   {(2015-02-02, 2)}

And what I am aiming for in HBase is to have one row with two columns, with the date and count:

Rowkey = 1 (Attributes.2015-02-03,2) (Attributes.2015-02-01,2)
Rowkey = 2 (Attributes.2015-02-02,2)

And this is the part I am struggling with, how do I define that I have a variable number of columns? I have tried the following as well as multiple other combinations:

STORE onlygroups into 'hbase://mytable'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('Attributes:*');

But get several errors, for example this one:

ERROR 2999: Unexpected internal error. org.apache.pig.data.InternalCachedBag 
    cannot be cast to java.util.Map

I have also tried using TOMAP but does not work either. Any suggestions?

Note: the recommended solution identified as duplicate does not solve my issue, it basically recommends using MapReduce and my data structure is different.

Upvotes: 1

Views: 1017

Answers (1)

maxteneff
maxteneff

Reputation: 1531

In order to load data to HBase your data in PIG should be in the following format:

tuple(key, map(col_qual, col_value))

In your case:

(1,[2015-02-03#2])
(1,[2015-02-01#2])
(2,[2015-02-02#2])

You can create this type of object right from your initial data:

A = LOAD 'items.csv' USING PigStorage(',') as (Id,ItemId,Count:chararray,CreationDate:chararray);
storeHbase = FOREACH A GENERATE ItemId, TOMAP(CreationDate, Count);

Or if you want to achieve it after some grouping by key:

B = FOREACH A GENERATE ItemId, TOTUPLE(CreationDate, Count) as pair;
C = GROUP B BY ItemId;
storeHbase = FOREACH C {
    Tmp = FOREACH $1 GENERATE TOMAP(pair.CreationDate,pair.Count);
    GENERATE group, FLATTEN(Tmp);
};

And after all you can load your data to the HBase:

STORE storeHbase into 'hbase://mytable' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('Attributes:*');

where mytable is your HBase table and Attributes is your column family.

Upvotes: 1

Related Questions