Reputation: 31
I am a beginner in Pig-Latin and I found a problem about FILTER statement. Look at the example:
Suppose we have a data file(test.txt) whose content is:
1,2,3
2,3,4
3,4,5
4,5,6
I want to select the records whose the 1st field is '3'. The Pig script is:
t = LOAD 'test.txt' USING PigStorage(',');
t1 = FOREACH t GENERATE $0 AS i0:chararray, $1 AS i1:chararray, $2 AS i2:chararray;
f1 = FILTER t1 BY i0 == '3';
DUMP f1
The task runs well but the output result is nothing. EXPLAIN f1 shows:
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-27
Map Plan
f1: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-26
|
|---f1: Filter[bag] - scope-22
| |
| Equal To[boolean] - scope-25
| |
| |---Project[chararray][0] - scope-23
| |
| |---Constant(3) - scope-24
|
|---t1: New For Each(false,false,false)[bag] - scope-21
| |
| Project[bytearray][0] - scope-15
| |
| Project[bytearray][1] - scope-17
| |
| Project[bytearray][2] - scope-19
|
|---t: Load(file:///Users/woody/test.txt:PigStorage(',')) - scope-14--------
Global sort: false
----------------
However, if I change the head 2 lines into:
t1 = LOAD 'test.txt' USING PigStorage(',') AS (i0:chararray, i1:chararray, i2:chararray)
(i.e. assign the schema in LOAD statement)
The task works well and the result is also correct. In this case, the EXPLAIN f1 shows:
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node scope-33
Map Plan
f1: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-32
|
|---f1: Filter[bag] - scope-28
| |
| Equal To[boolean] - scope-31
| |
| |---Project[chararray][0] - scope-29
| |
| |---Constant(3) - scope-30
|
|---t1: New For Each(false,false,false)[bag] - scope-27
| |
| Cast[chararray] - scope-19
| |
| |---Project[bytearray][0] - scope-18
| |
| Cast[chararray] - scope-22
| |
| |---Project[bytearray][1] - scope-21
| |
| Cast[chararray] - scope-25
| |
| |---Project[bytearray][2] - scope-24
|
|---t1: Load(file:///Users/woody/test.txt:PigStorage(',')) - scope-17--------
Global sort: false
----------------
Is it a Pig's bug? or is there any good way to avoid it ?
pig --version on my computer is:
Apache Pig version 0.9.2 (r1232772)
compiled Jan 18 2012, 07:57:19
Upvotes: 2
Views: 4907
Reputation: 1
This should work :
t = LOAD 'test.txt' USING PigStorage(',') AS (i0:int, i1:int, i2:int);
t = FILTER t BY i0 == 3;
DUMP t;
Upvotes: 0
Reputation: 601
interestingly, this appears to be a known issue, and regarded as a 'won't fix', so it's not really a bug. it is a strange behavior, and seems to explain a little wonkiness i've experienced in the past with regard to using the FILTER function.
the following is similar, and was closed with a 'won't fix' in the comment stream: https://issues.apache.org/jira/browse/PIG-1341
there does appear to be subtleties in the casts during loads, and this may help others: http://ofps.oreilly.com/titles/9781449302641/data_model.html#type_strength
the previous answer is spot-on - i've confirmed it does return the expected result.
this may prompt me to explicitly cast more in the future - great question and answer. sorry for not adding anything...but thought there was enough around the topic to post.
Upvotes: 2
Reputation: 7082
I know that in a GENERATE this gives a type to the data but does not perform the casting for real:
GENERATE $0 AS i0:chararray
You need to cast it manually:
t1 = FOREACH t GENERATE (chararray) $0 AS i0, (chararray) $1 AS i1, (chararray) $2 AS i2;
It is counter intuitive and is probably a bug.
Upvotes: 1