barryred
barryred

Reputation: 1123

PIG Latin statement very slow

I'm running a PIG script, and it all goes very quickly, until I get to the FOREACH ... GENERATE FLATTEN(...) line.

Is there a reason that that line should run so slowly. (It causes the entire script to time out on a fairly powerful cluster)

extended = FOREACH kRecords GENERATE *, NORMALIZE(query) AS query_norm:chararray;
-- DESCRIBE extended;
-- extended: {query: chararray,url: chararray,query_norm: chararray}

-- GROUP by both query and url
grouped = GROUP extended BY (query_norm, url);
-- DESCRIBE grouped;
-- grouped: {group: (query_norm: chararray,url: chararray),extended: {(query: chararray,url: chararray,query_norm: chararray)}}

-- Remove multiple items per record (but at the expense of duplicating records)
-- THE LINE BELOW IS THE SLOW ONE!!!
flattened = FOREACH grouped GENERATE FLATTEN(extended.query_norm), FLATTEN(extended.url);
-- THE LINE ABOVE IS THE SLOW ONE!!!

-- Remove duplicates
result = DISTINCT flattened;

Thanks, Barry

Upvotes: 1

Views: 758

Answers (1)

alexeipab
alexeipab

Reputation: 3619

When 2 FLATTEN(...) operators are used together after GENERATE you get Cartesian product between the 2 bags. So if a bag produced by the GROUP has N elements, after 2 FLATTEN(..) operators on the same bag you will get N*N rows generated per each group, it can tax heavily CPUs, HDDs and network. See following example:

CODE:

inpt = load '/pig_fun/input/group.txt' as (c1, c2);
grp = group inpt by (c1, c2);
flt = foreach grp generate FLATTEN(inpt.c1), FLATTEN(inpt.c2);

INPUT:

1       a
1       a
1       b
1       b
1       c

OUTPUT:

(1,a)
(1,a)
(1,a)
(1,a)
(1,b)
(1,b)
(1,b)
(1,b)
(1,c)

See how 2 records of (1,a) and 2 of (1,b) had caused 4 output records each. But 1 record of (1,c) caused just 1 output record.

Upvotes: 2

Related Questions