Reputation: 45
i have two data sets
1st set A
(111)
(222)
(555)
2nd set B
(333)
(444)
(666)
i did C = UNION A,B;
after appending two data sets output should be first data set and next second data set
Expected output C is
(111)
(222)
(555)
(333)
(444)
(666)
But my output C is
(333)
(444)
(666)
(111)
(222)
(555)
if i apply union the result is in not order it is difficult to me to append them in set order How can i do this ? i cant think of any but any help will be appreciated.
Upvotes: 0
Views: 2909
Reputation: 123
Add an extra column to each of the files giving the file_number and then do union of the modified data sets, followed by sorting based on the column giving 'file_number'
A = LOAD 'A.txt' USING PigStorage() AS (a:int);
B = LOAD 'B.txt' USING PigStorage() AS (b:int);
A_mod = FOREACH A GENERATE a, 1 AS file_number;
B_mod = FOREACH A GENERATE b, 2 AS file_number;
unified_mod = UNION A_mod, B_mod;
output = SORT unified_mod BY file_number;
Upvotes: 1
Reputation: 256
I've try the classic union and for me the data stay in order.
But let's try to force-it if it doesn't :)
well as I said in the previous comment it's not efficient but it makes the job.
--In order to determine nbA you can run the following cmd in the shell : wc -l A.txt
%default nbA 3
A = LOAD 'A.txt' USING PigStorage() AS (a:int);
B = LOAD 'B.txt' USING PigStorage() AS (b:int);
A = RANK A;
B = RANK B;
--DESCRIBE B;
B = FOREACH B GENERATE rank_B + $nbA, $1;
C= UNION B,A;
C= ORDER C BY $0;
C= FOREACH C GENERATE $1; --If you want to drop the first column
DUMP C;
Output :
(111)
(222)
(555)
(333)
(444)
(666)
Where :
A.txt
111
222
555
And B.txt:
333
444
666
Upvotes: 1