Reputation: 33243
I am new to hadoop and all its derivatives. And I am really getting intimidated by the abundance of information available.
But one thing I have realized is that to start implementing/using hadoop or distributed codes, one has to basically change the way they think about a problem.
I was wondering if someone can help me in the following.
So, basically (like anyone else) I have a raw data.. I want to parse it and extract some information and then run some algorithm and save the results.
Lets say I have a text file "foo.txt" where data is like:
id,$value,garbage_field,time_string\n
1, 200, grrrr,2012:12:2:13:00:00
2, 12.22,jlfa,2012:12:4:15:00:00
1, 2, ajf, 2012:12:22:13:56:00
As you can see that the id can be repeated.This id can be like how much money a customer has spent!! What I want to do is save the result in a file which contains how much money each of the customer has spent in "morning","afternoon""evening""night" (You can define your some time buckets to define what morning and all is. For example here probably
1, 0,202,0,0
1 is the id, 0--> 0$ spent in morning, 202 in afternon, 0 in evening and night
Now I have a python code for it.. But I have to implement this in pig.. to get started. If anyone can just write/guide me thru this.. Thats all I need to get started.
Thanks
Upvotes: 1
Views: 396
Reputation: 3261
I'd start like this:
foo = LOAD 'foo.txt' USING PigStorage(',') AS (
CUSTOMER_ID:int,
DOLLARS_SPENT:float,
GARBAGE_FIELD,
TIME_STRING:chararray
);
foo_with_timeslots = FOREACH foo {
GENERATE
CUSTOMER_ID,
DOLLARS_SPENT,
/* DO TIME SLOT CALCULATION HERE */ AS TIME_SLOT
;
}
I don't have much knowledge of date/time values in pig, so I'll leave how to do conversion from time string to timeslot, to you.
id_grouped_foo_with_timeslots = GROUP foo_with_timeslots BY (
CUSTOMER_ID,
TIME_SLOT
);
-- Calculate how much each customer spent at time slots
spent_per_customer_per_timeslot = FOREACH id_grouped_foo_with_timeslots {
GENERATE
group.CUSTOMER_ID as CUSTOMER_ID,
group.TIME_SLOT as TIME_SLOT,
SUM(foo_with_timeslots.DOLLARS_SPENT) as TOTAL_SPENT
;
}
You'll have an output like below in spent_per_customer_per_timeslot
1,Morning,200
1,Evening,100
2,Afternoon,30
At this point it should be trivial to re-group the data and put it in the shape you want.
Upvotes: 2