Breakinen
Breakinen

Reputation: 619

How to generate a row number in pig?

I'm using pig for data preparation, and I faced a problem which seems easy but I can't deal with:

for example, I have a column of names

name
------
Alicia
Ana
Benita 
Berta 
Bertha 

then how can I add a row number for each name? the result would be like this:

name    |  id
----------------
Alicia  |  1
Ana     |  2
Benita  |  3
Berta   |  4
Bertha  |  5

Thank you for reading this question!

Upvotes: 4

Views: 8604

Answers (4)

Matt Davies
Matt Davies

Reputation: 11

@cabad

On the surface it does appear that the RANK operator would work but you are not guaranteed to have an increasing row id without providing some constraints on your data.

The problem comes from any rows which are provided to the ranking operator that are equal would share the same rank. If you can meet the guarantee that no two rows have equal fields used for ranking then this approach might work, but I think I'd put it in the "square peg round hole" approach.

See this example from the [docs] http://pig.apache.org/docs/r0.11.0/basic.html#rank (ranks 2, 6, 10) :

C = rank A by f1 DESC, f2 ASC;

dump C;
(1,Tete,2,N)
(2,Ranjit,3,M)
(2,Ranjit,3,P)
(4,Michael,8,T)
(5,Jose,10,V)
(6,Jillian,8,Q)
(6,Jillian,8,Q)
(8,JaePak,7,Q)
(9,David,1,N)
(10,David,4,Q)
(10,David,4,Q)                

Upvotes: 1

cabad
cabad

Reputation: 4575

Pig did not have a mechanism to do this when you asked this question. However, Pig 0.11 introduced a RANK operator that can be used for this purpose.

Upvotes: 10

ihadanny
ihadanny

Reputation: 4483

sketch idea, assuming that the "name" column we want to order by is numeric and not a string. also assuming nice non-skewed distribution.

  1. WITH_GROUPS = foreach TABLE generate name, name / 100 as group_id;
  2. group WITH_GROUPS by group_id;
  3. PER_GROUP = generate group, count(*);
  4. ACCUM_PER_GROUP = cross join PER_GROUP with itself, calculate accumulative count per group;
  5. cogroup ACCUM_PER_GROUP with WITH_GROUPS by group_id;
  6. in the reducer run a UDF that assigns each row an id starting from this group accumulative_count

Upvotes: 1

Shatlyk Ashyralyyev
Shatlyk Ashyralyyev

Reputation: 69

Unfortunately, there is no way to enumerate rows in Pig Latin. At least, I couldn't find an easy way. One solution is to implement a separate MapReduce job with single Reduce task that does the actual enumeration. To be more precise,

Map phase: assign all rows to same key. Single Reduce task: receives single key with an iterator to all rows. Since reduce task will run just on 1 physical machine and "reduce function" will be called just once, local counter inside the function solves the problem.

If the data is huge and impossible to process on single reduce machine, then default MapReduce Counters on master node may be used.

Upvotes: 3

Related Questions