Reputation: 6029
I have a very big table of measurement data in MySQL and I need to compute the percentile rank for each and every one of these values. Oracle appears to have a function called percent_rank but I can't find anything similar for MySQL. Sure I could just brute-force it in Python which I use anyways to populate the table but I suspect that would be quite inefficient because one sample might have 200.000 observations.
Upvotes: 24
Views: 42777
Reputation: 221106
MySQL 8 finally introduced window functions, and among them, the PERCENT_RANK()
function you were looking for. So, just write:
SELECT col, percent_rank() OVER (ORDER BY col)
FROM t
ORDER BY col
Your question mentions "percentiles", which are a slightly different thing. For completeness' sake, there are PERCENTILE_DISC
and PERCENTILE_CONT
inverse distribution functions in the SQL standard and in some RBDMS (Oracle, PostgreSQL, SQL Server, Teradata), but not in MySQL. With MySQL 8 and window functions, you can emulate PERCENTILE_DISC
, however, again using the PERCENT_RANK
and FIRST_VALUE
window functions.
Upvotes: 3
Reputation: 1
Suppose we have a sales table like :
user_id,units
then following query will give percentile of each user :
select a.user_id,a.units,
(sum(case when a.units >= b.units then 1 else 0 end )*100)/count(1) percentile
from sales a join sales b ;
Note that this will go for cross join so result in O(n2) complexity so can be considered as unoptimized solution but seems simple given we do not have any function in mysql version.
Upvotes: 0
Reputation: 494
SELECT
c.id, c.score, ROUND(((@rank - rank) / @rank) * 100, 2) AS percentile_rank
FROM
(SELECT
*,
@prev:=@curr,
@curr:=a.score,
@rank:=IF(@prev = @curr, @rank, @rank + 1) AS rank
FROM
(SELECT id, score FROM mytable) AS a,
(SELECT @curr:= null, @prev:= null, @rank:= 0) AS b
ORDER BY score DESC) AS c;
Upvotes: 6
Reputation: 21
If you're combining your SQL with a procedural language like PHP, you can do the following. This example breaks down excess flight block times into an airport, into their percentiles. Uses the LIMIT x,y clause in MySQL in combination with ORDER BY
. Not very pretty, but does the job (sorry struggled with the formatting):
$startDt = "2011-01-01";
$endDt = "2011-02-28";
$arrPort= 'JFK';
$strSQL = "SELECT COUNT(*) as TotFlights FROM FIDS where depdt >= '$startDt' And depdt <= '$endDt' and ArrPort='$arrPort'";
if (!($queryResult = mysql_query($strSQL, $con)) ) {
echo $strSQL . " FAILED\n"; echo mysql_error();
exit(0);
}
$totFlights=0;
while($fltRow=mysql_fetch_array($queryResult)) {
echo "Total Flights into " . $arrPort . " = " . $fltRow['TotFlights'];
$totFlights = $fltRow['TotFlights'];
/* 1906 flights. Percentile 90 = int(0.9 * 1906). */
for ($x = 1; $x<=10; $x++) {
$pctlPosn = $totFlights - intval( ($x/10) * $totFlights);
echo "PCTL POSN for " . $x * 10 . " IS " . $pctlPosn . "\t";
$pctlSQL = "SELECT (ablk-sblk) as ExcessBlk from FIDS where ArrPort='" . $arrPort . "' order by ExcessBlk DESC limit " . $pctlPosn . ",1;";
if (!($query2Result = mysql_query($pctlSQL, $con)) ) {
echo $pctlSQL . " FAILED\n";
echo mysql_error();
exit(0);
}
while ($pctlRow = mysql_fetch_array($query2Result)) {
echo "Excess Block is :" . $pctlRow['ExcessBlk'] . "\n";
}
}
}
Upvotes: 2
Reputation: 9312
Here's a different approach that doesn't require a join. In my case (a table with 15,000+) rows, it runs in about 3 seconds. (The JOIN method takes an order of magnitude longer).
In the sample, assume that measure is the column on which you're calculating the percent rank, and id is just a row identifier (not required):
SELECT
id,
@prev := @curr as prev,
@curr := measure as curr,
@rank := IF(@prev > @curr, @rank+@ties, @rank) AS rank,
@ties := IF(@prev = @curr, @ties+1, 1) AS ties,
(1-@rank/@total) as percentrank
FROM
mytable,
(SELECT
@curr := null,
@prev := null,
@rank := 0,
@ties := 1,
@total := count(*) from mytable where measure is not null
) b
WHERE
measure is not null
ORDER BY
measure DESC
Credit for this method goes to Shlomi Noach. He writes about it in detail here:
http://code.openark.org/blog/mysql/sql-ranking-without-self-join
I've tested this in MySQL and it works great; no idea about Oracle, SQLServer, etc.
Upvotes: 20
Reputation: 4141
This is a relatively ugly answer, and I feel guilty saying it. That said, it might help you with your issue.
One way to determine the percentage would be to count all of the rows, and count the number of rows that are greater than the number you provided. You can calculate either greater or less than and take the inverse as necessary.
Create an index on your number. total = select count(); less_equal = select count() where value > indexed_number;
The percentage would be something like: less_equal / total or (total - less_equal)/total
Make sure that both of them are using the index that you created. If they are not, tweak them until they are. The explain query should have "using index" in the right hand column. In the case of the select count(*) it should be using index for InnoDB and something like const for MyISAM. MyISAM will know this value at any time without having to calculate it.
If you needed to have the percentage stored in the database, you can use the setup from above for performance and then calculate the value for each row by using the second query as an inner select. The first query's value can be set as a constant.
Does this help?
Jacob
Upvotes: 3
Reputation:
To get the rank, I'd say you need to (left) outer join the table on itself something like :
select t1.name, t1.value, count(distinct isnull(t2.value,0))
from table t1
left join table t2
on t1.value>t2.value
group by t1.name, t1.value
For each row, you will count how many (if any) rows of the same table have an inferior value.
Note that I'm more familiar with sqlserver so the syntax might not be right. Also the distinct may not have the right behaviour for what you want to achieve. But that's the general idea.
Then to get the real percentile rank you will need to first get the number of values in a variable (or distinct values depending on the convention you want to take) and compute the percentile rank using the real rank given above.
Upvotes: 0
Reputation: 4740
there is no easy way to do this. see http://rpbouman.blogspot.com/2008/07/calculating-nth-percentile-in-mysql.html
Upvotes: 4