Reputation: 10314
Suppose that I have a list of couples (id, value)
and a list of potentialIDs
.
For each of the potentialIDs
I wanto to count how many times that ID
appears in the first list.
E.g.
couples:
1 a
1 x
2 y
potentialIDs
1
2
3
Result:
1 2
2 1
3 0
I'm trying to do that in PigLatin
but it doesn't seem trivial.
Could you give me any hints?
Upvotes: 0
Views: 108
Reputation: 5186
The general plan is: you can group couples by id and do a COUNT
, then do a left join on potentialIDs and the output from the COUNT
. From there you can format it as you need. The code should explain how to do this in more detail.
NOTE: If you need me to go into more detail just let me know, but I think the comments should explain what is going on pretty well.
-- B generates the count of the number of occurrences of an id in couple
B = FOREACH (GROUP couples BY id)
-- Output and schema of the group is:
-- {group: chararray,couples: {(id: chararray,value: chararray)}}
-- (1,{(1,a),(1,x)})
-- (2,{(2,y)})
-- COUNT(couples) counts the number of tuples in the bag
GENERATE group AS id, COUNT(couples) AS count ;
-- Now we want to do a LEFT join on potentialIDs and B since it will
-- create nulls for IDs that appear in potentialIDs, but not in B
C = FOREACH (JOIN potentialIDs BY id LEFT, B BY id)
-- The output and schema for the join is:
-- {potentialIDs::id: chararray,B::id: chararray,B::count: long}
-- (1,1,2)
-- (2,2,1)
-- (3,,)
-- Now we pull out only one ID, and convert any NULLs in count to 0s
GENERATE potentialIDs::id, (B::count is NULL?0:B::count) AS count ;
The schema and output for C
is:
C: {potentialIDs::id: chararray,count: long}
(1,2)
(2,1)
(3,0)
If you don't want the disambiguate operator (the ::) in C
, you can just change the GENERATE
line to:
GENERATE potentialIDs::id AS id, (B::count is NULL?0:B::count) AS count ;
Upvotes: 1