frazman
frazman

Reputation: 33243

very basic pig-latin beginner code

I am new to hadoop and all its derivatives. And I am really getting intimidated by the abundance of information available.

But one thing I have realized is that to start implementing/using hadoop or distributed codes, one has to basically change the way they think about a problem.

I was wondering if someone can help me in the following.

So, basically (like anyone else) I have a raw data.. I want to parse it and extract some information and then run some algorithm and save the results.

Lets say I have a text file "foo.txt" where data is like:

 id,$value,garbage_field,time_string\n
  1, 200, grrrr,2012:12:2:13:00:00
  2, 12.22,jlfa,2012:12:4:15:00:00
  1, 2, ajf, 2012:12:22:13:56:00

As you can see that the id can be repeated.This id can be like how much money a customer has spent!! What I want to do is save the result in a file which contains how much money each of the customer has spent in "morning","afternoon""evening""night" (You can define your some time buckets to define what morning and all is. For example here probably

     1, 0,202,0,0 
1 is the id, 0--> 0$ spent in morning, 202 in afternon, 0 in evening and night

Now I have a python code for it.. But I have to implement this in pig.. to get started. If anyone can just write/guide me thru this.. Thats all I need to get started.

Thanks

Upvotes: 1

Views: 396

Answers (1)

Cihan Keser
Cihan Keser

Reputation: 3261

I'd start like this:

foo = LOAD 'foo.txt' USING PigStorage(',') AS (
    CUSTOMER_ID:int, 
    DOLLARS_SPENT:float, 
    GARBAGE_FIELD, 
    TIME_STRING:chararray
);

foo_with_timeslots = FOREACH foo {
    GENERATE 
        CUSTOMER_ID,
        DOLLARS_SPENT,
        /* DO TIME SLOT CALCULATION HERE */ AS TIME_SLOT
    ;
}

I don't have much knowledge of date/time values in pig, so I'll leave how to do conversion from time string to timeslot, to you.

id_grouped_foo_with_timeslots = GROUP foo_with_timeslots BY (
    CUSTOMER_ID, 
    TIME_SLOT
);

-- Calculate how much each customer spent at time slots
spent_per_customer_per_timeslot = FOREACH id_grouped_foo_with_timeslots {
    GENERATE 
        group.CUSTOMER_ID as CUSTOMER_ID,
        group.TIME_SLOT as TIME_SLOT,
        SUM(foo_with_timeslots.DOLLARS_SPENT) as TOTAL_SPENT
    ;
}

You'll have an output like below in spent_per_customer_per_timeslot

1,Morning,200
1,Evening,100
2,Afternoon,30

At this point it should be trivial to re-group the data and put it in the shape you want.

Upvotes: 2

Related Questions