Reputation: 314
I'm trying to write a pig latin script to pull the count of a dataset that I've filtered.
Here's the script so far:
/* scans by title */
scans = LOAD '/hive/scans/*' USING PigStorage(',') AS (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
productscans = FILTER scans BY (title MATCHES 'proactiv');
scancount = FOREACH productscans GENERATE COUNT($0);
DUMP scancount;
For some reason, I get the error:
Could not infer the matching function for org.apache.pig.builtin.COUNT as multiple or none of them fit. Please use an explicit cast.
What am I doing wrong here? I'm assuming it has something to do with the type of the field I'm passing in, but I can't seem to resolve this.
TIA, Jason
Upvotes: 9
Views: 13956
Reputation: 1815
COUNT requires a preceding GROUP ALL statement for global counts and a GROUP BY statement for group counts.
You can use any of below :
scans = LOAD '/hive/scans/*' USING PigStorage(',') AS (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
productscans = FILTER scans BY (title MATCHES 'proactiv');
grouped = GROUP productscans ALL;
count = FOREACH grouped GENERATE COUNT(productscans);
DUMP scancount;
Or
scans = LOAD '/hive/scans/*' USING PigStorage(',') AS (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
productscans = FILTER scans BY (title MATCHES 'proactiv');
grouped = GROUP productscans ALL;
count = FOREACH grouped GENERATE COUNT($1);
DUMP scancount;
Upvotes: 7
Reputation: 30089
Is this what you're looking for (group by all to bring everything into one bag, then count the items):
scans = LOAD '/hive/scans/*' USING PigStorage(',') AS (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
productscans = FILTER scans BY (title MATCHES 'proactiv');
grouped = GROUP productscans ALL;
count = FOREACH grouped GENERATE COUNT(productscans);
dump count;
Upvotes: 16
Reputation: 622
Maybe
/* scans by title */
scans = LOAD '/hive/scans/*' USING PigStorage(',') AS (thetime:long,product_id:long,lat:double,lon:double,user:chararray,category:chararray,title:chararray);
productscans = FILTER scans BY (title MATCHES 'proactiv');
scancount = FOREACH productscans GENERATE COUNT(productscans);
DUMP scancount;
Upvotes: 0