mkelly_deft
mkelly_deft

Reputation: 31

Determining membership of an int in a separate relation

This is in relation to determining if an int value from a tuple in one relation is a member value of a column from another relation in Pig Latin. I'm new to Pig Latin and finding it difficult to wrap my mind around the framework.

At the moment I have two tables, one containing a list of ids against tags with a small domain of values, and another with tuples containing an id and a tag id referring to the other table.

Here's orders.csv:

id, tag

1597, x

999, y

787, a

812, x

And tags.csv:

id, tag_id

11, 55

99, 812

22, 787

I need a method of working out if the tag_id of all tuples in the order table are a member of the subset of the ids of the tag table.

id, has_x

111, 0

99, 1

22, 0

This is what I have so far:

register 's3://bucket/jython_task.py' using jython as task;

tags = load 's3://bucket/tags.csv' USING PigStorage(',') AS (id: long, tag: chararray);

orders = load 's3://bucket/orders.csv' USING PigStorage(',') AS (id: long, tag_id: long);

tags = filter tags by tag == 'x';

x_cases = foreach tags generate tag;

tagged_orders = foreach orders generate id, tag_id, tasks.check_membership(tag_id, x_cases.tag) as is_x:int;

and the udf:

def check_membership(instance, value_list):
if instance != None:
    for value in value_list:
        if instance == value[0]:
            return 1
return 0

I get the error:

2012-09-20 23:53:45,377 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (7995), 2nd :(8028)

What am I doing wrong? is there a better way to be doing what I'm looking to do?

Upvotes: 2

Views: 458

Answers (2)

mkelly_deft
mkelly_deft

Reputation: 31

I eventually found a solution to my own problem, it involves a left outer join against the two relations and may have a more elegant solution, I'm open to any better solutions.

tags = load 's3://bucket/tags.csv' USING PigStorage(',') AS (id: long, tag: chararray);

orders = load 's3://bucket/orders.csv' USING PigStorage(',') AS (id: long, tag_id: long);

tags = filter tags by tag == 'x';

tag_cases = foreach tags generate id, 1 as found_tag:int;

tag_cases = distinct tag_cases;

example = join orders by o_id left outer tag_cases by id;

example = foreach example generate orders::o_id as id, (tag_cases is null ? 0 : 1) as has_tag;

Upvotes: 0

alexeipab
alexeipab

Reputation: 3619

I do not know what is the problem in the UDF, but you can get the result with pure PIG. Use COGROUP and IsEmpty built in function.

x_cases = cogroup orders by (tag_id), tags by (id);
tagged_orders = foreach x_cases generate flatten(orders), IsEmpty(tags);

or

tagged_orders = filter x_cases by not IsEmpty(tags);

It might not be the fastest running implementation as it uses Reduce side join, but it all depends on the volumes.

A faster approach could be to use replicated join, which will load the tags table into RAM and will use Map side join, which is faster. The bad thing is that you will lose the records that are not tagged.

tagged_orders = join orders by (tag_id), tags by (id) using 'replicated';

Upvotes: 0

Related Questions