sudheer
sudheer

Reputation: 337

Is there an Alternative for HBaseStorage in PIG

I am using HBaseStorage with -caching option in pig script as follows

HBaseStorage('countDetails:ansCount countDetails:divCount countDetails:unansCount countDetails:engCount countDetails:ineffCount countDetails:totalCount', '-caching 1000');

I can see this was reflecting in my job.xml but I can see there is no time difference in it I am processing 10 million records and store data around 160mb in to HBase. When I store the result in hdfs its taking 3 mins to process the same job takes 30mins to store into HBase.

I even tried by setting

SET hbase.client.scanner.caching 1000;

Please let me know how can I reduce the time. Is there any alternative for HBaseStorage? http://apmblog.compuware.com/2013/02/19/speeding-up-a-pighbase-mapreduce-job-by-a-factor-of-15/

the above blog says that I have to set hbase.client.scanner.caching in bootstrap scrip I don't know how to do that will it be enough If I set it in Hbase-conf. Please help me out of this

Upvotes: 1

Views: 418

Answers (2)

Ashish
Ashish

Reputation: 5791

In my experience HBase doesn't perform very well with Pig. It you don't have requirement for random look-up then use only HDFS otherwie HBase MR job would be better option. Also, In Hadoop MR job, you can connect to Hbase(This option gave me the best performance).

Upvotes: 1

Arun A K
Arun A K

Reputation: 2225

hbase.client.scanner.caching points to number of rows that will be fetched when calling next on a scanner if it is not served from (local, client) memory.

Higher caching values will enable faster scanners but will eat up more memory and some calls of next may take longer and longer time when the cache is empty. Do not set this value such that the time between invocations is greater than the scanner timeout; i.e. hbase.regionserver.lease.period This property is 1 min by default. Clients must report in within this period else they are considered dead.

Upvotes: 1

Related Questions