Reputation: 2164
In Pig, what is the effective way to get count? We can do a GROUP ALL, but this is given only 1 reducer. When the data size is very large,say n Terabytes, can we try multiple reducers somehow?
dataCount = FOREACH (GROUP data ALL) GENERATE
'count' as metric,
COUNT(dataCount) as value;
Upvotes: 1
Views: 10784
Reputation: 2485
I just dig a bit more in this topic, and it seems you don't have to afraid that a single reducer will have to process enormous amount of data if you're using an up-to-date pig version. The algebraic UDF-s will handle the COUNT smart, and it's calculated on the mapper. So the reducer just have to deal with the aggregated data (counts/mapper). I think it's introduced in 0.9.1, but 0.14.0 definitely has it
Algebraic Interface
An aggregate function is an eval function that takes a bag and returns a scalar value. One interesting and useful property of many aggregate functions is that they can be computed incrementally in a distributed fashion. We call these functions algebraic. COUNT is an example of an algebraic function because we can count the number of elements in a subset of the data and then sum the counts to produce a final output. In the Hadoop world, this means that the partial computations can be done by the map and combiner, and the final result can be computed by the reducer.
But my previous answer was definitely wrong:
In the grouping you can use the
PARALLEL n
keyword this set the number of reducers.Increase the parallelism of a job by specifying the number of reduce tasks, n. The default value for n is 1 (one reduce task).
Upvotes: -1
Reputation: 3570
Instead of using directly a GROUP ALL
, you could divide it into two steps. First, group by some field and count the number of rows. And then, perform a GROUP ALL
to sum all of these counts. This way, you would be able to count the number of rows in parallel.
Note, however, that if the field you use in the first GROUP BY
does not have duplicates, the resulting counts will all be of 1 so there wont be any difference. Try using a field that has many duplicates to improve its performance.
See this example:
a;1
a;2
b;3
b;4
b;5
If we first group by the first field, which has duplicates, the final COUNT
will deal with 2 rows instead of 5:
A = load 'data' using PigStorage(';');
B = group A by $0;
C = foreach B generate COUNT(A);
dump C;
(2)
(3)
D = group C all;
E = foreach D generate SUM(C.$0);
dump E;
(5)
However, if we group by the second one, which is unique, it will deal with 5 rows:
A = load 'data' using PigStorage(';');
B = group A by $1;
C = foreach B generate COUNT(A);
dump C;
(1)
(1)
(1)
(1)
(1)
D = group C all;
E = foreach D generate SUM(C.$0);
dump E;
(5)
Upvotes: 7