Reputation: 65
i have two hbase input aliases:
X:
(a1,b2)
(a2,b2)
...
(an,bn)
Y:
(c1)
(c2)
...
(cn)
Now i want to "join" both aliases: the first line from X with the first line from Y. The final result should be:
RESULT:
(a1,b1,c1)
(a2,b2,c2)
...
(an,bn,cn)
How can I do that?
Upvotes: 2
Views: 890
Reputation: 5186
If you are using pig 0.11, then you could try to use the RANK
operator. Something like:
XR = RANK X ;
YR = RANK Y ;
RESULT = JOIN XR BY $0, YR BY $0 ;
If you just do RANK X
then it will give each line a unique sequential number. If you do something like RANK A by $0 DESC
then it opens up the possibility of not giving a unique sequential number.
Upvotes: 1
Reputation: 5801
A basic tenet of Pig is that order does not matter. More generally, a relation is a set of tuples, not a list of tuples. If order is important to your data, that should be reflected in the data itself, not by the manner in which it happens to be stored.
Nevertheless, a workaround does exist if you can guarantee that when you load your data Pig will process it in the order you want. Use the Enumerate
UDF from DataFu:
Xenum = FOREACH (GROUP X ALL) GENERATE FLATTEN(Enumerate(X));
Yenum = FOREACH (GROUP Y ALL) GENERATE FLATTEN(Enumerate(Y));
RESULT = FOREACH (JOIN Xenum BY i, Yenum BY i) GENERATE a, b, c;
Upvotes: 1