Woody Wang
Woody Wang

Reputation: 31

Using FILTER after FOREACH in Pig-Latin failed

I am a beginner in Pig-Latin and I found a problem about FILTER statement. Look at the example:

Suppose we have a data file(test.txt) whose content is:

1,2,3
2,3,4
3,4,5
4,5,6

I want to select the records whose the 1st field is '3'. The Pig script is:

t = LOAD 'test.txt' USING PigStorage(',');
t1 = FOREACH t GENERATE $0 AS i0:chararray, $1 AS i1:chararray, $2 AS i2:chararray;
f1 = FILTER t1 BY i0 == '3';
DUMP f1

The task runs well but the output result is nothing. EXPLAIN f1 shows:

#--------------------------------------------------
# Map Reduce Plan                                  
#--------------------------------------------------
MapReduce node scope-27
Map Plan
f1: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-26
|
|---f1: Filter[bag] - scope-22
    |   |
    |   Equal To[boolean] - scope-25
    |   |
    |   |---Project[chararray][0] - scope-23
    |   |
    |   |---Constant(3) - scope-24
    |
    |---t1: New For Each(false,false,false)[bag] - scope-21
        |   |
        |   Project[bytearray][0] - scope-15
        |   |
        |   Project[bytearray][1] - scope-17
        |   |
        |   Project[bytearray][2] - scope-19
        |
        |---t: Load(file:///Users/woody/test.txt:PigStorage(',')) - scope-14--------
Global sort: false
----------------

However, if I change the head 2 lines into:

t1 = LOAD 'test.txt' USING PigStorage(',') AS (i0:chararray, i1:chararray, i2:chararray)

(i.e. assign the schema in LOAD statement)

The task works well and the result is also correct. In this case, the EXPLAIN f1 shows:

#--------------------------------------------------
# Map Reduce Plan                                  
#--------------------------------------------------
MapReduce node scope-33
Map Plan
f1: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-32
|
|---f1: Filter[bag] - scope-28
    |   |
    |   Equal To[boolean] - scope-31
    |   |
    |   |---Project[chararray][0] - scope-29
    |   |
    |   |---Constant(3) - scope-30
    |
    |---t1: New For Each(false,false,false)[bag] - scope-27
        |   |
        |   Cast[chararray] - scope-19
        |   |
        |   |---Project[bytearray][0] - scope-18
        |   |
        |   Cast[chararray] - scope-22
        |   |
        |   |---Project[bytearray][1] - scope-21
        |   |
        |   Cast[chararray] - scope-25
        |   |
        |   |---Project[bytearray][2] - scope-24
        |
        |---t1: Load(file:///Users/woody/test.txt:PigStorage(',')) - scope-17--------
Global sort: false
----------------

Is it a Pig's bug? or is there any good way to avoid it ?

pig --version on my computer is:

Apache Pig version 0.9.2 (r1232772)
compiled Jan 18 2012, 07:57:19

Upvotes: 2

Views: 4907

Answers (3)

This should work :

t = LOAD 'test.txt' USING PigStorage(',') AS (i0:int, i1:int, i2:int);

t = FILTER t BY i0 == 3;

DUMP t;

Upvotes: 0

dave campbell
dave campbell

Reputation: 601

interestingly, this appears to be a known issue, and regarded as a 'won't fix', so it's not really a bug. it is a strange behavior, and seems to explain a little wonkiness i've experienced in the past with regard to using the FILTER function.

the following is similar, and was closed with a 'won't fix' in the comment stream: https://issues.apache.org/jira/browse/PIG-1341

there does appear to be subtleties in the casts during loads, and this may help others: http://ofps.oreilly.com/titles/9781449302641/data_model.html#type_strength

the previous answer is spot-on - i've confirmed it does return the expected result.

this may prompt me to explicitly cast more in the future - great question and answer. sorry for not adding anything...but thought there was enough around the topic to post.

Upvotes: 2

Romain
Romain

Reputation: 7082

I know that in a GENERATE this gives a type to the data but does not perform the casting for real:

GENERATE $0 AS i0:chararray

You need to cast it manually:

t1 = FOREACH t GENERATE (chararray) $0 AS i0, (chararray) $1 AS i1, (chararray) $2 AS i2;

It is counter intuitive and is probably a bug.

Upvotes: 1

Related Questions