t k
t k

Reputation: 65

Merging two datasets in Apache Pig

i have two hbase input aliases:

X:
(a1,b2)
(a2,b2)
...
(an,bn)

Y:
(c1)
(c2)
...
(cn)

Now i want to "join" both aliases: the first line from X with the first line from Y. The final result should be:

RESULT:
(a1,b1,c1)
(a2,b2,c2)
...
(an,bn,cn)

How can I do that?

Upvotes: 2

Views: 890

Answers (2)

mr2ert
mr2ert

Reputation: 5186

If you are using pig 0.11, then you could try to use the RANK operator. Something like:

XR = RANK X ;
YR = RANK Y ;
RESULT = JOIN XR BY $0, YR BY $0 ;

If you just do RANK X then it will give each line a unique sequential number. If you do something like RANK A by $0 DESC then it opens up the possibility of not giving a unique sequential number.

Upvotes: 1

reo katoa
reo katoa

Reputation: 5801

A basic tenet of Pig is that order does not matter. More generally, a relation is a set of tuples, not a list of tuples. If order is important to your data, that should be reflected in the data itself, not by the manner in which it happens to be stored.

Nevertheless, a workaround does exist if you can guarantee that when you load your data Pig will process it in the order you want. Use the Enumerate UDF from DataFu:

Xenum = FOREACH (GROUP X ALL) GENERATE FLATTEN(Enumerate(X));
Yenum = FOREACH (GROUP Y ALL) GENERATE FLATTEN(Enumerate(Y));
RESULT = FOREACH (JOIN Xenum BY i, Yenum BY i) GENERATE a, b, c;

Upvotes: 1

Related Questions