Aslan986
Aslan986

Reputation: 10314

Counting task in Pig Latin

Suppose that I have a list of couples (id, value) and a list of potentialIDs.

For each of the potentialIDs I wanto to count how many times that ID appears in the first list.

E.g.

couples:
1 a
1 x
2 y

potentialIDs
1
2
3

Result:
1 2
2 1
3 0

I'm trying to do that in PigLatin but it doesn't seem trivial.

Could you give me any hints?

Upvotes: 0

Views: 108

Answers (1)

mr2ert
mr2ert

Reputation: 5186

The general plan is: you can group couples by id and do a COUNT, then do a left join on potentialIDs and the output from the COUNT. From there you can format it as you need. The code should explain how to do this in more detail.

NOTE: If you need me to go into more detail just let me know, but I think the comments should explain what is going on pretty well.

-- B generates the count of the number of occurrences of an id in couple
B = FOREACH (GROUP couples BY id) 
    -- Output and schema of the group is:
    -- {group: chararray,couples: {(id: chararray,value: chararray)}}
    -- (1,{(1,a),(1,x)})
    -- (2,{(2,y)})

    -- COUNT(couples) counts the number of tuples in the bag
    GENERATE group AS id, COUNT(couples) AS count ;

-- Now we want to do a LEFT join on potentialIDs and B since it will
-- create nulls for IDs that appear in potentialIDs, but not in B
C = FOREACH (JOIN potentialIDs BY id LEFT, B BY id) 
    -- The output and schema for the join is:
    -- {potentialIDs::id: chararray,B::id: chararray,B::count: long}
    -- (1,1,2)
    -- (2,2,1)
    -- (3,,)

    -- Now we pull out only one ID, and convert any NULLs in count to 0s
    GENERATE potentialIDs::id, (B::count is NULL?0:B::count) AS count ;

The schema and output for C is:

C: {potentialIDs::id: chararray,count: long}
(1,2)
(2,1)
(3,0)

If you don't want the disambiguate operator (the ::) in C, you can just change the GENERATE line to:

GENERATE potentialIDs::id AS id, (B::count is NULL?0:B::count) AS count ;

Upvotes: 1

Related Questions