rkh
rkh

Reputation: 863

pig join and average

I am trying to learn pig by myself and I have the following script:

customer_ratings = LOAD 'customer_ratings.txt' as (i_id:int, customer_id:int, rating:int); 
item_data = LOAD 'item_data.txt' USING PigStorage(',') as (item_id:int,item_name:chararray, dummy:int,item_url:chararray);
item_join = join item_data by item_id, customer_ratings by i_id;
item_group = GROUP item_join ALL;
item_foreach = foreach item_group generate item_id, item_name, item_url,  AVG(item_join.rating);
PRINT = limit item_foreach 40;
dump PRINT;

The foreach is failing with the following error:

  Invalid field projection. Projected field [item_id] does not exist in schema: group:char array,item_join:bag{:tuple(item_data::item_id:int,item_data::item_name:char array,item_data::dummy:int,item_data::item_url:chararray,customer_ratings::i_id:int,customer_ratings::customer_id:int,customer_ratings::rating:int)}.

I know there's something I didn't understand through the tutorials in order to achieve this... Any idea how to print what I have in the foreach?

I also tried generate item_data::item_id, item_data::item_name, etc. as explained in (pig - how to reference columns in a FOREACH after a JOIN?) but this also didn't work...

Upvotes: 1

Views: 1290

Answers (1)

Metropolis
Metropolis

Reputation: 2128

customer_ratings = LOAD 'customer_ratings.txt' as (i_id:int,customer_id:int, rating:int); 

item_data = LOAD 'item_data.txt' USING PigStorage(',') as (item_id:int,item_name:chararray, dummy:int,item_url:chararray);

item_join = foreach (
             join item_data by item_id, 
             customer_ratings by i_id
             )
            generate 
             item_data::item_id as item_id, 
             item_data::item_name as item_name,
             cutsomer_rating::rating as rating
            ;

item_group = GROUP item_join by (item_id, item_url);

item_foreach = foreach item_group generate 
                FLATTEN(group) as (item_id, item_url), 
                AVG(item_join.rating)
               ;

PRINT = limit item_foreach 40;

dump PRINT;

Something like this, I think, works. Though I haven't tested it. There are two things I did. First, after the join, I went and named the fields something simple so that we don't have to carry around a bunch of fields with names like relation.fieldname.

Flattening the group is a much easier way to get the key out of the group by. In your example, I think you'd need to use something like

generate item_join.item_data::item_id 

Upvotes: 2

Related Questions