Reputation: 2442
I have simple data loaded in a:
dump a
ahoeh,1,e32
hello,2,10
ho,3
I need to filter out all rows with number of columns/fields different than 3. How to do it? In other words result should be:
dump results
ahoeh,1,e32
hello,2,10
I know there should be a FILTER built-in function. However I cannot figure out what condition (number of columns =3) should be defined.
Thanks!
Upvotes: 2
Views: 3158
Reputation: 835
(It seems that I don't have enough karma to comment; that's why this is posted as a new answer.)
The accepted answer doesn't quite behave as expected if null/empty string is a valid field value; you need to use COUNT_STAR
instead of COUNT
to count empty/null fields in your schema.
See: https://pig.apache.org/docs/r0.9.1/func.html#count-star
For example, given the following input data:
1,2,3
1,,3
and this Pig script:
a = load 'input' USING PigStorage(',');
counted = foreach a generate COUNT_STAR(TOBAG(*)), $0..;
filtered = filter counted by $0 != 3;
result = foreach filtered generate $1..;
The filtered
alias will contain both rows. The difference is that COUNT({(1),(),(3)})
returns 2
while COUNT_STAR({(1),(),(3)})
returns 3
.
Upvotes: 1
Reputation: 4724
Can you try this?
input
ahoeh,1,e32
hello,2,10
ho,3
3,te,0
aa,3,b
y,h,3
3,3,3
3,3,3,1,2,3,3,,,,,,4,44,6
PigScript1:
A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,','));
C = FOREACH B GENERATE COUNT(TOBAG(*)),$0..;
D = FILTER C BY $0==3;
E = FOREACH D GENERATE $1..;
DUMP E;
PigScript2:
A = LOAD 'input' USING PigStorage(',');
B = FOREACH A GENERATE COUNT(TOBAG(*)),$0..;
C = FILTER B BY (int)$0==3;
D = FOREACH C GENERATE $1..;
DUMP D;
Output:
(ahoeh,1,e32)
(hello,2,10)
(3,te,0)
(aa,3,b)
(y,h,3)
(3,3,3)
Upvotes: 2
Reputation: 3078
I see two ways to do this: First, you can rephrase the filter I think, as it boils down to: Give me all lines that do not contain an NULL value. For lots of columns, writing this filter statement is rather tedious.
Second, you could convert your columns into a bag per line, using TOBAG (http://pig.apache.org/docs/r0.12.1/func.html#tobag) and then write a UDF that processes the input bag to check for null tuples in this bag and return true or false and use this in the filter statement.
Either way, some tediousness is required I think.
Upvotes: 0