xhudik
xhudik

Reputation: 2442

PIG filter out rows with improper number of columns

I have simple data loaded in a:

dump a

ahoeh,1,e32
hello,2,10
ho,3

I need to filter out all rows with number of columns/fields different than 3. How to do it? In other words result should be:

dump results

ahoeh,1,e32
hello,2,10

I know there should be a FILTER built-in function. However I cannot figure out what condition (number of columns =3) should be defined.

Thanks!

Upvotes: 2

Views: 3158

Answers (3)

Jacob
Jacob

Reputation: 835

(It seems that I don't have enough karma to comment; that's why this is posted as a new answer.)

The accepted answer doesn't quite behave as expected if null/empty string is a valid field value; you need to use COUNT_STAR instead of COUNT to count empty/null fields in your schema.

See: https://pig.apache.org/docs/r0.9.1/func.html#count-star

For example, given the following input data:

1,2,3
1,,3

and this Pig script:

a = load 'input' USING PigStorage(',');
counted = foreach a generate COUNT_STAR(TOBAG(*)), $0..;
filtered = filter counted by $0 != 3;
result = foreach filtered generate $1..;

The filtered alias will contain both rows. The difference is that COUNT({(1),(),(3)}) returns 2 while COUNT_STAR({(1),(),(3)}) returns 3.

Upvotes: 1

Sivasakthi Jayaraman
Sivasakthi Jayaraman

Reputation: 4724

Can you try this?

input

ahoeh,1,e32
hello,2,10
ho,3
3,te,0
aa,3,b
y,h,3
3,3,3
3,3,3,1,2,3,3,,,,,,4,44,6

PigScript1:

A = LOAD 'input' AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,','));
C = FOREACH B GENERATE COUNT(TOBAG(*)),$0..;
D = FILTER C BY $0==3;
E = FOREACH D GENERATE $1..;
DUMP E;

PigScript2:

A = LOAD 'input' USING PigStorage(',');
B = FOREACH A GENERATE COUNT(TOBAG(*)),$0..;
C = FILTER B BY (int)$0==3;
D = FOREACH C GENERATE $1..;
DUMP D;

Output:

(ahoeh,1,e32)
(hello,2,10)
(3,te,0)
(aa,3,b)
(y,h,3)
(3,3,3)

Upvotes: 2

LiMuBei
LiMuBei

Reputation: 3078

I see two ways to do this: First, you can rephrase the filter I think, as it boils down to: Give me all lines that do not contain an NULL value. For lots of columns, writing this filter statement is rather tedious.

Second, you could convert your columns into a bag per line, using TOBAG (http://pig.apache.org/docs/r0.12.1/func.html#tobag) and then write a UDF that processes the input bag to check for null tuples in this bag and return true or false and use this in the filter statement.

Either way, some tediousness is required I think.

Upvotes: 0

Related Questions