Alexandro Giles
Alexandro Giles

Reputation: 492

Use REPLACE in multicondition RegExp in Pig Latin

I have this set of data:

dump data;

This is a sample output: (this dataset is almost a million rows long).

("0",60,0,1,"Fri")
("1",47,0,1,"Mon")
("1",23,1,0,"Tue")
("1",60,0,0,"Sat")
("1",50,1,1,"Fri")

I want to replace the values: Sat, Fri, Mon to numbers of week, I know how to use REPLACE for change just 1 value at a time, but I have to repeat it multiple times in order to change all days of the week:

data_day_of_week = FOREACH data GENERATE $0,$1,$2,$3,REPLACE($4, 'Mon', '1');

Is there any way to do this in only one statement?

Upvotes: 2

Views: 89

Answers (4)

Alexandro Giles
Alexandro Giles

Reputation: 492

saph_top, was the closer one to answer my question, nonetheless after testing it was resolving blank output, I'm going to complement his answer:

'Mon' is not the same as "Mon", therefore when I was using: CASE WHEN $4 == 'Mon' THEN '1' It wasn´t replacing anything, resulting in blank result in: data_day_of_week.

To solve this, I just add " " (double quotes to the condition):

data_day_of_week = FOREACH data GENERATE
CASE
     WHEN $4 == '"Mon"' THEN '1'
     WHEN $4 == '"Tue"' THEN '2'
     ...
     WHEN $4 == '"Sun"' THEN '7'
END AS day_number;

After that in order to rebuild the data I add the following to GENERATE Clause:

data_day_of_week = FOREACH data GENERATE $0,$1,$2,$3,
    CASE
         WHEN $4 == '"Mon"' THEN '1'
         WHEN $4 == '"Tue"' THEN '2'
         ...
         WHEN $4 == '"Sun"' THEN '7'
    END AS day_number;

And the output was complete now: dump data_day_of_week;

("0",60,0,1,5)
("1",47,0,1,1)
("1",23,1,0,2)
("1",60,0,0,6)
("1",50,1,1,5)

Upvotes: 0

pauljcg
pauljcg

Reputation: 193

You could combine the Pig ToDate and ToString functions:

data_day_of_week = FOREACH data GENERATE $0,$1,$2,$3,
    ToString(ToDate($4, 'EEE'), 'e') as day_of_week;

The ToDate functions will convert the chararray day of the week to a Pig date time format. Then ToString will convert this to a format of your choosing.

According to the Java docs a single e or c should give the numeric format for the day of the week, where Monday is 1.

Upvotes: 1

Clover
Clover

Reputation: 19

You can use JOIN with a reference to a mapping like this:

(Mon,1)
(Tue,2)
(Wed,3)
(Thu,4)
(Fri,5)
(Sat,6)
(Sun,7)

Join statement:

outer_left = join your_data by $4 left outer, day_mapping by day;

Upvotes: 0

saph_top
saph_top

Reputation: 677

You should use a CASE WHEN THEN statement

data_day_of_week = FOREACH data GENERATE
    CASE
         WHEN $4 == 'Mon' THEN '1'
         WHEN $4 == 'Tue' THEN '2'
         ...
         WHEN $4 == 'Sun' THEN '7'
    END AS day_number;

You should also name your relations so not to use $1, $2 etc. If you name $4 as day_number then when you declare the variable as day_number from the CASE statement it'll "overwrite" your prior data.

Upvotes: 0

Related Questions