Reputation: 1961
I need to load from Pig to HBase using HBaseStorage and I can't figure out how to load with a variable number of columns for a specific column family.(With a known number of columns it is straightforward)
Data that looks like this: (spaces added for readibility)
Id,ItemId,Count,Date
1 ,1 ,2 ,2015-02-01
2 ,2 ,2 ,2015-02-02
3 ,1 ,2 ,2015-02-03
And I have an HBase table with rowkey and one column family called Attributes. So I load first the csv using:
A = LOAD 'items.csv' USING PigStorage(',')
as (Id,ItemId,Count:chararray, CreationDate:chararray);
And now I want to group them by ItemId so I do the following:
B = FOREACH A GENERATE ItemId, TOTUPLE(Date, Count);
C = GROUP B BY ItemId
So I get my data nicely grouped, with the key and then the tuples with Date and Count:
1 {(2015-02-03, 2),(2015-02-01, 2)}
2 {(2015-02-02, 2)}
And what I am aiming for in HBase is to have one row with two columns, with the date and count:
Rowkey = 1 (Attributes.2015-02-03,2) (Attributes.2015-02-01,2)
Rowkey = 2 (Attributes.2015-02-02,2)
And this is the part I am struggling with, how do I define that I have a variable number of columns? I have tried the following as well as multiple other combinations:
STORE onlygroups into 'hbase://mytable'
USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('Attributes:*');
But get several errors, for example this one:
ERROR 2999: Unexpected internal error. org.apache.pig.data.InternalCachedBag
cannot be cast to java.util.Map
I have also tried using TOMAP but does not work either. Any suggestions?
Note: the recommended solution identified as duplicate does not solve my issue, it basically recommends using MapReduce and my data structure is different.
Upvotes: 1
Views: 1017
Reputation: 1531
In order to load data to HBase your data in PIG should be in the following format:
tuple(key, map(col_qual, col_value))
In your case:
(1,[2015-02-03#2])
(1,[2015-02-01#2])
(2,[2015-02-02#2])
You can create this type of object right from your initial data:
A = LOAD 'items.csv' USING PigStorage(',') as (Id,ItemId,Count:chararray,CreationDate:chararray);
storeHbase = FOREACH A GENERATE ItemId, TOMAP(CreationDate, Count);
Or if you want to achieve it after some grouping by key:
B = FOREACH A GENERATE ItemId, TOTUPLE(CreationDate, Count) as pair;
C = GROUP B BY ItemId;
storeHbase = FOREACH C {
Tmp = FOREACH $1 GENERATE TOMAP(pair.CreationDate,pair.Count);
GENERATE group, FLATTEN(Tmp);
};
And after all you can load your data to the HBase:
STORE storeHbase into 'hbase://mytable' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('Attributes:*');
where mytable
is your HBase table and Attributes
is your column family.
Upvotes: 1