Reputation: 9828
I need the select the top x% rows of a table in Pig. Could someone tell me how to do it without writing a UDF?
Thanks!
Upvotes: 0
Views: 1727
Reputation: 1013
I won't write the pig code as it will take a while to write and test, but I would do it like this (if you need the exact solution, if not, there are simpler methods):
Count the data points for each quantile.
At this point the min point of the top 10% will fall into one of these intervals. Find this interval (this is easy as the counts will tell you exactly where it is), and using the sum of the counts of the larger quantiles together with the relevant quantile find the 10% point in this interval.
Go over your data again and filter out everything but the points larger than the one you just found.
Portions of this might require UDFs.
Upvotes: 0
Reputation: 3261
As mentioned before, first you need to count the number of rows in your table and then obviously you can do:
A = load 'X' as (row);
B = group A all;
C = foreach B generate COUNT(A) as count;
D = LIMIT A C.count/10; --you might need a cast to integer here
The catch is that, dynamic argument support for LIMIT
function was introduced in Pig 0.10. If you're working with a previous version, then a suggestion is offered here using the TOP
function.
Upvotes: 3
Reputation: 30089
Not sure how you would go about pulling a percentage, but if you know your table size is 100 rows, you can use the LIMIT command to get the top 10% for example:
A = load 'myfile' as (t, u, v);
B = order A by t;
C = limit B 10;
(Above example adapted from http://pig.apache.org/docs/r0.7.0/cookbook.html#Use+the+LIMIT+Operator)
As for dynamically limiting to 10%, not sure you can do this without knowing how 'big' the table is, and i'm pretty sure you couldn't do this in a UDF, you'd need to run a job to count the number of rows, then another job to do the LIMIT query.
Upvotes: 0