Reputation: 619
I'm using pig for data preparation, and I faced a problem which seems easy but I can't deal with:
for example, I have a column of names
name
------
Alicia
Ana
Benita
Berta
Bertha
then how can I add a row number for each name? the result would be like this:
name | id
----------------
Alicia | 1
Ana | 2
Benita | 3
Berta | 4
Bertha | 5
Thank you for reading this question!
Upvotes: 4
Views: 8604
Reputation: 11
@cabad
On the surface it does appear that the RANK operator would work but you are not guaranteed to have an increasing row id without providing some constraints on your data.
The problem comes from any rows which are provided to the ranking operator that are equal would share the same rank. If you can meet the guarantee that no two rows have equal fields used for ranking then this approach might work, but I think I'd put it in the "square peg round hole" approach.
See this example from the [docs] http://pig.apache.org/docs/r0.11.0/basic.html#rank (ranks 2, 6, 10) :
C = rank A by f1 DESC, f2 ASC;
dump C;
(1,Tete,2,N)
(2,Ranjit,3,M)
(2,Ranjit,3,P)
(4,Michael,8,T)
(5,Jose,10,V)
(6,Jillian,8,Q)
(6,Jillian,8,Q)
(8,JaePak,7,Q)
(9,David,1,N)
(10,David,4,Q)
(10,David,4,Q)
Upvotes: 1
Reputation: 4575
Pig did not have a mechanism to do this when you asked this question. However, Pig 0.11 introduced a RANK operator that can be used for this purpose.
Upvotes: 10
Reputation: 4483
sketch idea, assuming that the "name" column we want to order by is numeric and not a string. also assuming nice non-skewed distribution.
Upvotes: 1
Reputation: 69
Unfortunately, there is no way to enumerate rows in Pig Latin. At least, I couldn't find an easy way. One solution is to implement a separate MapReduce job with single Reduce task that does the actual enumeration. To be more precise,
Map phase: assign all rows to same key. Single Reduce task: receives single key with an iterator to all rows. Since reduce task will run just on 1 physical machine and "reduce function" will be called just once, local counter inside the function solves the problem.
If the data is huge and impossible to process on single reduce machine, then default MapReduce Counters on master node may be used.
Upvotes: 3