Reputation: 235
I have the following data (loaded in variable A):
(a1:a2:a3|a4:a5:a6)
(b1:b2:b3)
(c1:c2:c3|c4:c5:c6|c7:c8:c9)
I want my final output to be as follows:
(a1:a2:a3)
(a4:a5:a6)
(b1:b2:b3)
(c1:c2:c3)
(c4:c5:c6)
(c7:c8:c9)
Here is what I did:
B = foreach B generate flatten(STRSPLIT($0, '\\|')) as splitted:chararray;
This converted the data to:
(a1:a2:a3,a4:a5:a6)
(b1:b2:b3)
(c1:c2:c3,c4:c5:c6,c7:c8:c9)
with the following structure:
B: {splitted: chararray}
However, when I try to flatten this chararray into separate tuples it only spits out the first item. I have tried several different ways to get the output I want but I always get the first item. Here are a couple of things I tried:
req_output = foreach B generate flatten(STRSPLIT(splitted, ','));
req_output = foreach B generate flatten(TOBAG(*));
In both cases I get the following output:
(a1:a2:a3)
(b1:b2:b3)
(c1:c2:c3)
I am not sure why this is happening. How can I get all the items as different tuples? I do not have much experience in pig so any help would be appreciated.
Upvotes: 1
Views: 253
Reputation: 4724
In the relation B
you are storing only the first item(i.e splitted variable
), that is the reason for this issue. Can you remove the variable splitted
from the relation B?
B = foreach B generate flatten(STRSPLIT($0, '\\|')) as splitted:chararray;
TO
B = foreach B generate flatten(STRSPLIT($0, '\\|'));
You can solve this problem in couple of ways.
input:
a1:a2:a3|a4:a5:a6
b1:b2:b3
c1:c2:c3|c4:c5:c6|c7:c8:c9
Option1: Using TOKENIZE
A = LOAD 'input' USING PigStorage() AS(line:chararray);
B = FOREACH A GENERATE FLATTEN(TOKENIZE(line,'\\|'));
DUMP B;
Option2: Using STRSPLIT + TOBAG
A = LOAD 'input' USING PigStorage() AS(line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line,'\\|'));
C = FOREACH B GENERATE FLATTEN(TOBAG(*));
DUMP C;
Option3: Using STRSPLITTOBAG (Only in Pig Version 0.14)
A = LOAD 'input' USING PigStorage() AS(line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLITTOBAG(line,'\\|'));
DUMP B;
Output:
(a1:a2:a3)
(a4:a5:a6)
(b1:b2:b3)
(c1:c2:c3)
(c4:c5:c6)
(c7:c8:c9)
Upvotes: 3