harish kumar
harish kumar

Reputation: 45

Apache Pig : Append two data sets to one

i have two data sets

1st set A

(111)

(222)

(555)

2nd set B

(333)

(444)

(666)

i did C = UNION A,B;

after appending two data sets output should be first data set and next second data set

Expected output C is

(111)

(222)

(555)

(333)

(444)

(666)

But my output C is

(333)

(444)

(666)

(111)

(222)

(555)

if i apply union the result is in not order it is difficult to me to append them in set order How can i do this ? i cant think of any but any help will be appreciated.

Upvotes: 0

Views: 2909

Answers (2)

Pranamesh
Pranamesh

Reputation: 123

Add an extra column to each of the files giving the file_number and then do union of the modified data sets, followed by sorting based on the column giving 'file_number'

A = LOAD 'A.txt' USING PigStorage() AS (a:int);
B = LOAD 'B.txt' USING PigStorage() AS (b:int);
A_mod = FOREACH A GENERATE a, 1 AS file_number;
B_mod = FOREACH A GENERATE b, 2 AS file_number;
unified_mod = UNION A_mod, B_mod;
output = SORT unified_mod BY file_number;

Upvotes: 1

Samoht-Sann
Samoht-Sann

Reputation: 256

I've try the classic union and for me the data stay in order.

But let's try to force-it if it doesn't :)

well as I said in the previous comment it's not efficient but it makes the job.

--In order to determine nbA you can run the following cmd in the shell : wc -l A.txt
%default nbA 3

A = LOAD 'A.txt' USING PigStorage() AS (a:int);
B = LOAD 'B.txt' USING PigStorage() AS (b:int);

A = RANK A;
B = RANK B;

--DESCRIBE B; 
B = FOREACH B GENERATE rank_B + $nbA, $1;

C= UNION B,A;
C= ORDER C BY $0;
C= FOREACH C GENERATE $1; --If you want to drop the first column
DUMP C;

Output :

(111)
(222)
(555)
(333)
(444)
(666)

Where :

A.txt

111
222
555

And B.txt:

333
444
666

Upvotes: 1

Related Questions