Tarun
Tarun

Reputation: 11

Filter Out NULL value via Pig

I have table which i read via following.

 A = load 'customer' using PigStorage('|');

In customer following are some of the row

7|Ron|[email protected]
8|Rina  
9|Don|[email protected]
9|Don|[email protected]
10|Maya|[email protected]

11|marry|[email protected]

When i use following....

B = DISTINCT A;
A_CLEAN = FILTER B by ($0 is not null) AND ($1 is not null) AND ($2 is not null);

it removes 8|Rina as well

How to remove null rows via Pig?

Is there a way i can try A_CLEAN = filter B by not IsNULL()???

I am new to pig so not sure what i shuld put inside IsNULL...

Thanks

A_CLEAN = filter B by not IsEmpty(B);

Upvotes: 1

Views: 7274

Answers (2)

Sivasakthi Jayaraman
Sivasakthi Jayaraman

Reputation: 4724

 Tarun, instead AND condition why can't you put OR condition.
        A_CLEAN = FILTER B by ($0 is not null) OR ($1 is not null) OR ($2 is not null);
 This will remove all the null rows and retain if any columns is not empty. 
 Can you try and let me know if this works for your all conditions?

UPDATE:
I don't know why IsEmpty() is not working for you, its working for me. IsEmpty will work only with bag, so i converted all the fields to bag and test the emptiness. see the below working code.

input.txt
7|Ron|[email protected]
8|Rina
9|Don|[email protected]
9|Don|[email protected]
10|Maya|[email protected]

11|marry|[email protected]

PigSCript:
A = LOAD 'input.txt' USING PigStorage('|');
B = DISTINCT A;
A_CLEAN = FILTER B BY NOT IsEmpty(TOBAG($0..));
DUMP A_CLEAN;

Output:
(8,Rina  )
(7,Ron,[email protected])
(9,Don,[email protected])
(10,Maya,[email protected])
(11,marry,[email protected])

For your another question, its a simple math calculation

In case of AND, 
8|Rina
 will be treated as
 ($0 is not null) AND ($1 is not null) AND ($2 is not null)
 (true) AND (true) AND (false)
 (false) -->so this record will be skipped by Filter command

In case of OR, 
8|Rina
 will be treated as
 ($0 is not null) OR ($1 is not null) OR ($2 is not null)
 (true) OR (true) OR (false)
 (true) -->so this record will be included into the relation by Filter command

In case of empty record, 
<empty record>
  will be treated as
  ($0 is not null) OR ($1 is not null) OR ($2 is not null)
  (false) OR (false) OR (false)
  (false) -->so this record will be skipped by Filter command

Upvotes: 0

Anish Thomas
Anish Thomas

Reputation: 106

Try the following:

A = LOAD 'customer' USING PigStorage('|');
B = DISTINCT A;
A_CLEAN = FILTER B BY NOT(($0 IS NULL) AND ($1 IS NULL) AND ($2 IS NULL));
DUMP A_CLEAN;

This will produce the output:

(8,Rina )
(7,Ron,[email protected])
(9,Don,[email protected])
(10,Maya,[email protected])
(11,marry,[email protected])

In PIG, you cannot test a tuple for emptiness.

Upvotes: 2

Related Questions