Tim
Tim

Reputation: 2028

Group by expression in pig

Consider I have a dataset with tuples (f1, f2). I want to get my data in two bags: one where fi is null and the other where f1 values are not null. I try:

raw = LOAD 'somedata' USING PigStorage() AS (f1:chararray, f2:chararray);
raw_group = GROUP raw BY f1 is null;
raw_count = FOREACH raw_group GENERATE group, COUNT_STAR(raw);

I expect to get two groups with keys true and false. When I run it in grunt I get the following:

2013-12-26 14:56:10,958 [main] ERROR org.apache.pig.tools.grunt.Grunt - 
ERROR 1200: <line 1046, column 25>  Syntax error, unexpected symbol at or near 'f1'

I can do a workaround:

raw_group = GROUP raw BY (f1 is null)?0:1;

, but I really like to understand what's going on here, as I just started to learn Pig. According to Pig documentation I can use expressions as a grouping key. Do I miss something here or nulls are treated differently in Pig?

Upvotes: 0

Views: 261

Answers (1)

reo katoa
reo katoa

Reputation: 5801

The boolean datatype was introduced in Pig 0.10. The expression f1 is null is a boolean, so it can't appear as a field in a relation, which it would do if it were the value of group. Prior to Pig 0.10, booleans could only be used in FILTER statements or in the ternary operator, as you showed in your workaround.

While I haven't tried this out, presumably if you were to attempt the same thing in Pig 0.10 or later, your original attempt would succeed.

Upvotes: 1

Related Questions