Reputation: 1123
I'm running a PIG script, and it all goes very quickly, until I get to the FOREACH ... GENERATE FLATTEN(...)
line.
Is there a reason that that line should run so slowly. (It causes the entire script to time out on a fairly powerful cluster)
extended = FOREACH kRecords GENERATE *, NORMALIZE(query) AS query_norm:chararray;
-- DESCRIBE extended;
-- extended: {query: chararray,url: chararray,query_norm: chararray}
-- GROUP by both query and url
grouped = GROUP extended BY (query_norm, url);
-- DESCRIBE grouped;
-- grouped: {group: (query_norm: chararray,url: chararray),extended: {(query: chararray,url: chararray,query_norm: chararray)}}
-- Remove multiple items per record (but at the expense of duplicating records)
-- THE LINE BELOW IS THE SLOW ONE!!!
flattened = FOREACH grouped GENERATE FLATTEN(extended.query_norm), FLATTEN(extended.url);
-- THE LINE ABOVE IS THE SLOW ONE!!!
-- Remove duplicates
result = DISTINCT flattened;
Thanks, Barry
Upvotes: 1
Views: 758
Reputation: 3619
When 2 FLATTEN(...) operators are used together after GENERATE you get Cartesian product between the 2 bags. So if a bag produced by the GROUP has N elements, after 2 FLATTEN(..) operators on the same bag you will get N*N rows generated per each group, it can tax heavily CPUs, HDDs and network. See following example:
CODE:
inpt = load '/pig_fun/input/group.txt' as (c1, c2);
grp = group inpt by (c1, c2);
flt = foreach grp generate FLATTEN(inpt.c1), FLATTEN(inpt.c2);
INPUT:
1 a
1 a
1 b
1 b
1 c
OUTPUT:
(1,a)
(1,a)
(1,a)
(1,a)
(1,b)
(1,b)
(1,b)
(1,b)
(1,c)
See how 2 records of (1,a) and 2 of (1,b) had caused 4 output records each. But 1 record of (1,c) caused just 1 output record.
Upvotes: 2